An evaluation of statistical methods for DNA methylation microarray data analysis

Simulation study

We conducted simulation studies to compare the power and stability of six DNA methylation
differential analysis methods for independent and correlated DNA methylation levels
across CpG loci. Each simulation study included 1,000 independently generated two
group samples with sample size (n) of 3, 6, 12, or 24 in both the cancer and normal groups. For all simulations, we
set the total number of DNA methylation loci (m) as 1000. The fractions of truly differentially methylated loci were set at 1 %, 5 %, 10 %, 25 %, 50 %, 75 %, or 90 % to cover different scenarios.
To mimic the data distribution of a real DNA methylation array experiment, the ? values from the DNA methylation array studies for both cancer and normal groups were
generated from a mixed beta distribution (0.1B e t a(0.5,5)+0.9B e t a(5,0.5)), for independent DNA methylation levels across CpG loci. For correlated DNA
methylation levels across CpG loci, the ? values y ij
for ith locus and jth subject were generated from where b ij
and e ij
were the (i,j)th elements of m×2n matrixes of B and E respectively. B was simulated from a B e t a(0.1,0.1) distribution and E was an matrix of error following a factor-analytic structure E=L U T
+?25]. L=Z×? in which Z (a m×4 factor loadings matrix) denoted methylation profiles for constituent cell types
and ?=d i a g(0.55,0.35,0.07,0.03)
T
(a 4×4 factor scores diagonal matrix) denoted cell proportions through its diagonal
elements. U was a 2n×4 matrix of latent effects, and ? was a m×2n random error matrix. Both Z and U were simulated from N(0,1) distribution. ? were simulated from N(0,? i
) with (i.e. the standard deviation of each value was , but the errors were correlated across CpG loci). To set up the mean ? value differences between groups, all ? values of 1 %, 5 %, 10 %, 25 %, 50 %, 75 %, or 90 % of 1000 CpG loci in normal group
subtracted a sequential vector from 0.1 to 0.4 with a length of 10, 50, 100, 250,
500, 750, and 900. For instance, with 1 % true difference between groups, the first
10 rows of ? values from the normal group will equal the original ? values in the normal group, generated either from the mixed beta distribution 0.1B e t a(0.5,5)+0.9B e t a(5,0.5) or from , subtracted the sequential vector with a length of 10 from 0.1 to 0.4, i.e. (0.10,0.13,0.17,0.20,0.23,0.27,0.30,0.33,0.37,0.40).
The M-values were generated using the l o g i t2 transformation of the ?-values and the FDR level was set at 5 %.

Simulation results

Independent cases

For simulated DNA methylation data with sample sizes as small as 3 in each group,
all methods could control FDR at a desired level of 5 % (Fig. 1a and Fig. 2a). In terms of power, the empirical Bayes method was the most powerful, followed by
the bump hunting method and the t-test when the proportion of differentially methylated loci was below 50 % (Table
2 and Table 3). The bump hunting method was the most powerful method, followed by the empirical
Bayes method and the t-test when the proportion of differentially methylated loci
was above 50 %. Within the bump hunting method, power is higher with Storey’s q-value procedure than with Benjamini and Hochberg’s procedure. Neither the Wilcoxon
rank sum test, the Kolmogorov-Smirnov test, nor the permutation test had power to
identify any truly differentially methylated locus across all proportions of differentially
methylated loci tested. For stability, the empirical Bayes method was much better
than either the t-test or the bump hunting method (Table 4 and Table 5). The bump hunting method had the largest standard deviation of total discoveries
once p-values were adjusted using Storey’s q-value procedure. The standard deviation of total discoveries from the bump hunting
method increased exponentially as the proportions of differentially methylated loci
increased. In the simulation studies, no significant differences were observed between
? values and M values in terms of FDR control, power, mean number of total discoveries, or standard
deviation of total discoveries.

Fig. 1. Estimated FDRs, powers, means of total discoveries, and SD of total discoveries of
six DNA methylation differential analysis methods using ? values (Independent case). Blue solid: rank test; Red dashed: t-test; Green dotted: KS test; Black dotdash: permutation test; Orange twodash: Empirical
Bayes; Yellow twodash: Bump Hunting BH adjustment (BH-BH); Purple twodash: Bump Hunting
q-value adjustment (BH-q)

Fig. 2. Estimated FDRs, powers, means of total discoveries, and SD of total discoveries of
six DNA methylation differential analysis methods using M values (Independent case). Blue solid: rank test; Red dashed: t-test; Green dotted: KS test; Black dotdash: permutation test; Orange twodash: Empirical
Bayes; Yellow twodash: Bump Hunting BH adjustment (BH-BH); Purple twodash: Bump Hunting
q-value adjustment (BH-q)

Table 2. Powers across six DNA methylation differential analysis methods using ? values for independent case

Table 3. Powers across six DNA methylation differential analysis methods using M values for independent case

Table 4. Standard deviation of total discoveries across six DNA methylation differential analysis
methods using ? values for independent case

Table 5. Standard deviation of total discoveries across six DNA methylation differentially
analysis methods using M values for independent case

Increasing sample size to 6 in each group, the Wilcoxon rank sum test, the Kolmogorov-Smirnov
test, and the permutation test all showed greater than zero power (Fig. 1b and Fig. 2b). While all methods could control FDR at 5 %, the empirical Bayes method remained
the most powerful among all methods, followed by the bump hunting method and the t-test, when the proportion of differentially methylated loci was below 25 % (Table
2 and Table 3). The bump hunting method was the most powerful method, followed by the empirical
Bayes method and the t-test/permutation test, whenever the proportion of differentially
methylated loci was above 25 %. The power of the Wilcoxon rank sum test, the Kolmogorov-Smirnov
test, and the permutation test was lower than the t-test whenever the proportion of differentially methylated loci was below 25 %; however,
the power of the permutation test was similar to the t-test, whenever the proportion of differentially methylated loci was above 25 %. The
Wilcoxon rank sum test and the Kolmogorov-Smirnov test had relatively lower power
compared to the other methods, even after the proportion of differentially methylated
loci increased to 25 % or higher. In terms of stability, the bump hunting method had
the largest standard deviation of total discoveries and an exponentially increasing
trend, especially when the proportion of differentially methylated loci was larger
than 50 % (Table 4 and Table 5). All other methods showed similar stability, while the empirical Bayes method had
relatively the smallest standard deviation of total discoveries across all proportions
of differentially methylated loci. Significant differences were not observed between
? values and M-values in terms of power, mean number of total discoveries, and standard deviation
of total discoveries, except that the FDR was controlled at a lower level whenever
the ? values were used for the empirical Bayes method.

For a moderate sample size of 12 in each group, power was not significantly different
across methods whenever the proportion of differentially methylated loci was greater
than 1 % (Fig. 1c and Fig. 2c). The mean number of total discoveries was also similar. standard deviation of total
discoveries was maintained at a relatively low level for all methods across all proportions
of differentially methylated loci, except for the bump hunting method, which showed
a relatively large standard deviation of total discoveries and an exponentially increasing
trend, whenever the proportion of differentially methylated loci was above 25 % (Table
4 and Table 5). All methods still controlled FDR within a 5 % level and had a more conservative
control of FDR as the proportion of differentially methylated loci increased, with
the exception of the bump hunting method which was less conservative in FDR control
as the proportion of differentially methylated loci increased. Aside from the fact
that the FDR was controlled at a lower level whenever the ? values were used for the empirical Bayes method, no significant differences were
observed between ? values and M-values in terms of power, mean number of total discoveries, and standard deviation
of total discoveries.

Similar simulation results were observed when sample size was increased to 24 in each
group (Fig. 1d and Fig. 2d). The power of all methods became almost identical, and the large standard deviation
of the bump hunting method became more obvious. Whenever ? values were used for analysis using the empirical Bayes method, the FDR was controlled
at a relatively lower level as compared to using M values. No significant power or stability differences were observed between the ? values and M values.

Overall, the power and stability of all methods increased as sample size increased
for both ? values and M values (Table 2, 3, 4 and 5). It was observed that the permutation method retained lower power whenever the proportion
of differentially methylated loci was as low as 1 %, regardless of sample size. The
Wilcoxon rank sum test and the Kolmogorov-Smirnov test had exactly the same power
and stability whenever either ? values or M values were used.

Correlated cases

When methylation levels were correlated across CpG loci and sample size was as small
as 3 in each group, the FDR and power estimates were different from independent cases.
The t-test and empirical Bayes method had very large FDR estimates with both ? values and M values. The bump hunting method had estimated FDR exceeding 0.05 when ? values were used, but the FDR was well controlled at 5 % when M values were used (Fig. 3a and Fig. 4a). Interestingly, the bump hunting method had much higher power than all other methods
especially when the proportion of differentially methylated loci was lower than 25
%. We also noticed that the power of the bump hunting method was higher when using
? values than when using M values (Table 6 and Table 7). The bump hunting method also had a larger mean of total discoveries than all other
methods, and identified more loci when ? values were used. The stability trend remained the same as in the independent case.
The bump hunting method still had the lowest stability among all methods compared
(Table 8 and Table 9).

Fig. 3. Estimated FDRs, powers, means of total discoveries, and SD of total discoveries of
six DNA methylation differential analysis methods using ? values (Correlated case). Blue solid: rank test; Red dashed: t-test; Green dotted: KS test; Black dotdash: permutation test; Orange twodash: Empirical
Bayes; Yellow twodash: Bump Hunting BH adjustment (BH-BH); Purple twodash: Bump Hunting
q-value adjustment (BH-q)

Fig. 4. Estimated FDRs, powers, means of total discoveries, and variances of total discoveries
of six DNA methylation differential analysis methods using M values (Correlated case). Blue solid: rank test; Red dashed: t-test; Green dotted: KS test; Black dotdash: permutation test; Orange twodash: Empirical
Bayes; Yellow twodash: Bump Hunting BH adjustment (BH-BH); Purple twodash: Bump Hunting
q-value adjustment (BH-q)

Table 6. Powers across six DNA methylation differential analysis methods using ? values for correlated case

Table 7. Powers across six DNA methylation differential analysis methods using M values for correlated case

Table 8. Standard deviation of total discoveries across six DNA methylation differential analysis
methods using ? values for correlated case

Table 9. Standard deviation of total discoveries across six DNA methylation differentially
analysis methods using M values for correlated case

When increasing sample size from 3 to 6 in each group, the FDR and power estimates
also showed different characteristics from independent cases. The t-test still had very large FDR estimates when either ? values or M values were used. The empirical Bayes method had decent control of FDR when ? values were used; however, it lost control of FDR when M values were used (Fig. 3b and Fig. 4b). When M values were used, the bump hunting method had the highest power among all methods
compared. When ? values were used, the bump hunting method was the most powerful method, followed
by the empirical Bayes method and the t-test whenever the proportion of differentially methylated loci was lower than 10
% or higher than 75 %, while the empirical Bayes method had the highest power whenever
the proportion of differentially methylated loci was between 10 % and 75 % (Table
6 and Table 7). For stability, the bump hunting method had still the largest standard deviation
of total discoveries whenever the proportion of differentially methylated loci was
high, either with ? values or M values (Table 8 and Table 9).

For a sample size of 12 in each group, the power and stability of all methods started
to converge. The Wilcoxon rank sum test, the t-test, and the Kolmogorov-Smirnov test had estimated FDR values larger than 0.05 whenever
? values were used and the proportion of differentially methylated loci was smaller
than 10 %. When M values were used, only the bump hunting method and the permutation test had FDR controlled
at 5 % whenever the proportion of differentially methylated loci was smaller than
10 % (Fig. 3c and Fig. 4c). With ? values, the empirical Bayes method had slightly higher power than all other methods
whenever the proportion of differentially methylated loci was smaller than 25 %. The
bump hunting method had the highest power whenever the proportion of differentially
methylated loci was greater than 25 %. Using M values, the permutation test had the highest power whenever the proportion of differentially
methylated loci was smaller than 50 %, and the bump hunting method had the highest
power whenever the proportion of differentially methylated loci was greater than 50
% (Table 6 and Table 7). The stability of all methods began to converge, but the bump hunting method still
showed slightly larger standard deviation than all other methods compared (Table 8 and Table 9).

With a sample size of 24 in each group, the power of all methods was similar (Table
6 and Table 7). The t-test had estimated FDR over 5 % whenever ? values were used and the proportion of differentially methylated loci was smaller
than 10 %. All methods had estimated FDR within 5 % when M values were used (Fig. 3d and Fig. 4d). The bump hunting method with Storey’s q-value adjustment still showed low stability whenever the proportion of differentially
methylated loci was large (Table 8 and Table 9).

In summary, the power and stability of all methods showed differences when using ? values versus M values in all correlated cases (Table 6, 7, 8 and 9). Whenever sample sizes were 3, 6, or 12 in each group, the t-test, the empirical Bayes method, and the bump hunting method had larger power using
? values than M values. The same observation was made whenever sample size was increased to 24 in
each group with the proportion of differentially methylated loci smaller than 25 %.
The Wilcoxon rank sum test and the Kolmogorov-Smirnov test had similar power using
? values or M values, and the permutation test had higher power using M values than ? values. The permutation method still retained low power whenever the proportion of
differentially methylated loci was as low as 1 %, regardless of sample size. All methods
were observed to produce slightly larger standard deviations whenever using ? values rather than M values, except for the bump hunting method and the empirical Bayes method whenever
the proportion of differentially methylated loci was larger than 50 % for sample size
of 12 in each group.

Real data examples

Ovarian cancer

Ovarian cancer ranks fifth in cancer death among women in the United States 26]. Aberrant DNA methylation was found to be associated with ovarian cancer. A genome
wide DNA methylation profiling of United Kingdom Ovarian Cancer Population Study (UKOPS)
was conducted to identify methylation signatures associated with carcinogenesis 23]. The data is available publicly, downloaded from the NCBI GEO website with GEO number
GSE19711. The data originated from the Illumina Infinium 27k Human DNA methylation
Beadchip v1.2 with 27578 CpGs from 540 whole blood samples, and 266 samples were taken
from post-menopausal ovarian cancer patients, and 274 from normal controls (age-matched).
To illustrate the differences in apparent test power (total number of discoveries)
across the six methods at different FDR levels, we randomly selected either 3, 6,
or 12 samples from both the cancer pre-treatment group and control group. The FDR
levels ranged from 0.01 to 0.10, with a step of 0.01. Due to lack of significants
using adjusted p-values for all methods, the raw p-values were used for comparisons. Thus, the Storey’s q-value procedure and the Benjamini-Hochberg procedure from the bump hunting method
had the same raw p-values.

When we randomly took 3 samples from both the cancer and control groups, both the
empirical Bayes method and the bump hunting method showed higher apparent test power
than the four other methods (Fig. 5). No discoveries were made either with the Wilcoxon rank sum test, the Kolmogorov-Smirnov
test, or the permutation test, below a FDR level of 0.08. The t-test had lower apparent test power than the empirical Bayes method and the bump hunting
method. However, differences between the empirical Bayes method and the bump hunting
method were not significant. When we randomly selected 6 samples from both groups,
total discoveries were similar for all methods, except for the Kolmogorov-Smirnov
test which had still a relatively lower apparent test power than all other methods.
No significant differences were observed between results using either ?-values or M-values. When increasing sample size further to 12 in each group, we observed that
all methods had more convergent apparent test power than when using a sample size
of 6 in each group. Similarly, the Kolmogorov-Smirnov test showed increased power
as sample size increased to 12 in each group.

Fig. 5. Total discoveries of the six DNA methylation differential analysis methods using both
? and M values for methylation 27k data. Blue solid: rank test; Red dashed: t-test; Green dotted: KS test; Black dotdash: permutation test; Orange twodash: Empirical
Bayes; Yellow twodash: Bump Hunting BH adjustment (BH-BH); Purple twodash: Bump Hunting
q-value adjustment (BH-q)

Rheumatoid arthritis

Rheumatoid arthritis is a complex disease whose etiology involves the interaction
of genetic, environmental, and life-style factors 27]. Epigenome-wide associations study have implicated DNA methylation as an intermediary
of genetic risk in rheumatoid arthritis using Illumina HumanMethylation450 arrays
on 354 rheumatoid arthritis cases and 337 controls 24]. The Methylation data was downloaded from the NCBI GEO website with accession number
GSE42861. To demonstrate further the differences in apparent test power across the
six methods at different FDR levels for popular HumanMethylation450 arrays, we randomly
selected samples of size 3, 6, or 12 from both the rheumatoid arthritis case group
and control group with the same FDR level set in the Ovarian Cancer example. The Storey’s
q-value procedure and the Benjamini-Hochberg procedure from the bump hunting method
had the same raw p-values.

The apparent test power showed similar results to those observed in the ovarian cancer
example (Fig. 6). When sample size was 3 in each group, the empirical Bayes method and the bump hunting
method had higher apparent test power than all other methods compared. The empirical
Bayes method had a slightly higher apparent test power than the bump hunting method
when ? values were used, while the bump hunting method had a slighter higher apparent test
power than the empirical Bayes method when M-values were used. All other methods showed similar results to those observed in the
ovarian cancer example. When sample size was further increased to 6 or 12 in each
group, the apparent test power of all methods compared were similar.

Fig. 6. Total discoveries of the six DNA methylation differential analysis methods using both
? and M values for methylation 450k data. Blue solid: rank test; Red dashed: t-test; Green dotted: KS test; Black dotdash: permutation test; Orange twodash: Empirical
Bayes; Yellow twodash: Bump Hunting BH adjustment (BH-BH); Purple twodash: Bump Hunting
q-value adjustment (BH-q)

Overall, the results of the apparent test power comparisons of the six DNA methylation
differential analysis methods using real data were consistent with our simulation
results.