Confident difference criterion: a new Bayesian differentially expressed gene selection algorithm with applications


In this section, two different simulation studies were conducted to investigate the
performance of the proposed confident difference criterion methods on identifying
DE genes for microarray or sequence-based studies, respectively. In addition, a real
affymetrix dataset is used to further demonstrate the proposed methodology.

Simulation study I: Microarray data

Two settings were considered. In the first setting, the intensity values having different
means and variances between two biological conditions on each DE gene are simulated.
In the second setting, the data were simulated from three biological conditions, with
DE genes having different mean and variance values between at least two biological
conditions.

Setting 1 (Two conditions)

Fifty simulations were used in this study to investigate the performance of different
versions of the confident difference criterion methods described in the Confident
difference criterion section. In each simulation, there were 5000 genes in total and
500 DE genes with 10 replications under each of the two biological groups. The log-scaled
data were generated via with ? g
=1,?g=1,…,250 and ? g
=?1,?g=251,…,500 for the DE genes, and for the remaining EE genes. The average intensities ? g
were generated from an uniform distribution, where for all genes. Conditionally conjugate priors described in the Model for microarray
data subsection were used for all parameters ? g
, ? g
, , and .

The simulated data were analyzed using both Methods I and II of the confident difference
criteroin methods. For each version, the cutoff value ?0
were set to be 0.4, 0.6, or the cutoff value controlling the FDR to be no more than
0.05, separately. The genes with the calculated posterior distribution values ? g
via Equation (1) or Equation (3) less than the chosen ?0
were identified to be DE. To evaluate the performance of the confident difference
criterion methods, the simulated datasets were also analyzed by four existing methods:
Significant Analysis of Microarrays (SAM) 33], Linear Models for Microarray Data (LIMMA) 28], Semiparametric Hierarchical Method (SPH) 23], Empirical Bayesian Analysis of Gene Expression Data (EBarrays or EBA) 14]. All these existing methods allowed a control of the FDR for multiple comparisons.
The genes were declared to be DE with FDR controlled at 0.05 for all these four methods.

Based on the identified gene list by each method, we calculated the number of claimed
DE genes (CDE), the number of correctly claimed DE genes (CCDE), the number of correctly
claimed EE genes (CCEE), the false positive rate (FPR), false negative rate (FNR),
false discovery rate (FDR) and false non-discovery rate (FNDR) for all considered
methods. These results and their standard deviations reported in parentheses were
summarized in Table 1. Note, for Methods I and II, the choice of ?0
=0.4 identified the a larger number of DE genes when compared to the choice of ?0
=0.6. While for the case with FDR control of 0.05, the Method II identified the largest
number of DE genes among all six methods. We also compared the results of the confident
difference criterion method with a control of FDR against all four existing methods.
We expected a method with good performance will provide a good control of FDR and
provide smaller error rates in terms of FPR, FNR and FNDR. Under both versions of
the confident difference criterion method, the achieved FDR is close to but less than
0.05, implying that the proposed confident difference criterion methods provided a
good control of the FDR. All four existing methods also obtained a control of FDR
at 0.05 successfully, although the SPH method provides a conservative control of FDR
with the empirical FDR equal to 0.02. Since all methods provided small error rates
of FPR and FNDR, we put more weight to the comparison of the empirical FNRs among
all applied methods. The results in Table 1 showed that Method II provided the smallest empirical FNR out of all methods by successfully
identifying almost all truly DE genes; and Method I had comparable empirical FNR as
the SAM and the LIMMA methods, and much smaller empirical FNR when compared to the
SPH and the EBA methods.

Table 1. Performance evaluation under Study I (Setting 1), (G=5000, 500 DE gene)
#

Setting 2 (Three conditions)

The data were simulated from three biological conditions, and the first biological
condition was considered as the reference group. A gene was set to be DE so that at
least one group would be either up or down regulated from the reference group. Specifically,
500 DE genes out of 5000 genes were set in the data, and the log intensities of the
DE genes were generated via . Depending on whether the gene was set to be DE in one or both conditions from reference
group, the parameters ? g1
and ? g2
were set to have ? g1
=? g2
=1.5 for g=1,?,62 (up-regulated in both conditions); ? g1
=1.5, ? g2
=0 for g=63,?,125 (only up-regulated in condition 2); ? g1
=1.5, ? g2
=?1.5 for g=126,?,187 (up-regulated in condition 2, down-regulated in condition 3); ? g1
=0 and ? g2
=1.5 for g=188,?,250 (only up-regulated in condition 3); ? g1
=0, ? g2
=?1.5 for g=251,?,312 (only down-regulated in condition 3); ? g1
=?1.5, ? g2
=1.5 for g=313,?, 375 (down-regulated in condition 2, up-regulated in condition 3); ? g1
=?1.5, ? g2
=0 for g=376, ?,437 (only down-regulated in condition 2); ? g1
=? g2
=?1.5, for g=438,?,500 (down-regulated in both conditions). The remaining genes were EE and their
log intensities were generated via for t=1,2,3 and g=501,?,5000. On all genes, the parameter ? g1
were generated from an uniform distribution, i.e., . Each condition contained 10 replicates on each gene and 50 simulations were conducted.

The model similar to those described in Model for microarray data subsection and the
proposed confident difference criterion methods including the Method I for comparing
mean expressions and the Method II comparing both mean and variance expressions were
applied to the simulated data. We considered three choices for the cutoff value ?0
, including prespecified values 0.4, 0.6, or a value with FDR controlled at 0.05,
separately. The data were also analyzed by the SAM 33], LIMMA 28], and EBArrays 14] with the FDR controlled at 0.05. The SPH 23] were not used in the study as they were proposed for studies with two biological
conditions only. The analytical results from all methods were compared based on four
error rates including FPR, FNR, FDR and FNDR from each considered method (Table 2). The confident difference criterion methods including Method I and Method II as
well as the existing methods except LIMMA all provided an empirical FDR no more than
0.05 successfully. Comparing to the existing methods, the proposed confident difference
criterion methods provided comparable FPR and smaller FNR and FNDR. Method II of the
confident difference criterion method compares both mean and variance values of the
gene expression intensities across different biological conditions. This is a potential
reason for the proposed method providing smaller FNR for microarray data analysis.
The confident difference criterion method is particularly effective when both mean
and variance of the expression intensities differ across biological conditions on
the DE genes.

Table 2. Performance evaluation under Study I (Setting 2), (G=5000, 500 DE gene)
#

Simulation Study II: sequence-based data

The focus of this study is to investigate the performance of the proposed confident
difference criterion method for identifying DE genes from sequence-based high-throughput
experiments including SAGE and RNA-Seq studies.

Setting 1 (SAGE experiment)

The simulation proposed by Lu et al. 20] was used to conduct the simulation study. Specifically, 5000 genes were sampled under
five libraries from each of the two conditions with fixed library sizes sampled uniformly
between 30000 and 90000. A total of 500 genes were set to be DE genes. The data were
generated from a negative binomial distribution, for gene g, for a fixed library k of condition t, where m tk
was the library size for library k under condition t; ?1
and ?2
denoted the dispersion parameters for data from the two conditions separately, and
both set to be 0.4; ? gt
measured the expression level of gene g under condition t and were set with different values when gene g is DE and the same value when gene g is EE. For g=1,?,250, we set ? g1
=8E?4 and ? g2
=2E?4 to include down-regulated genes in condition 2. For g=251,?,500, we set ? g1
=2E?4 and ? g2
=8E?4 to include up-regulated genes in condition 2. For other genes with g=501,?,5000, we set ? g1
=? g2
=2E?4 to include EE genes. Fifty simulations were used in this study.

The proposed confident difference criterion method for RNA-Seq data was used to analyze
the simulated data. The posterior probability ? g
measuring the evidence of differential gene expression were estimated using average
value of its posterior sampled values. The cutoff value ?0
for ? g
were set to be 0.4, 0.6 or a value to control the FDR to be 0.05, separately. The
genes with estimated ? g
less than the chosen ?0
value were claimed to be DE. We also fit several other existing methods including
edgeR 26], DESeq 2], BaySeq 10], NBPSeq 8], EBSeq 17], NOISeq 32], SAMSeq 19], and TSPM 3]. When the edgeR method was applied, we chose both options to estimate the common
dispersion parameter for all tags and the tag-wise dispersion parameters respectively.
For the NOISeq method, we estimated and controlled the FDR using the method proposed
by Newton et al. 23] for identifying DE genes.

The results using the proposed confident difference criterion methods and all fitted
existing methods for RNA-Seq data were summarized in Table 3. Similar to the simulation study I for microarray data, Table 3 showed that the higher the cutoff value ?0
, the less number of genes were identified to be DE. The confident difference criterion
method with a control of FDR at 0.05 achieved an empirical FDR of 0.044, and successfully
identified 328.8 genes (65.8 %) on average out of 500 truly DE genes. Compared to
other considered methods, the confident difference criterion method performed the
best by providing the smallest FNR and FNDR while maintaining comparable FPR and a
well controlled FDR. Out of the applied existing methods, the NOISeq method and edgeR
method achieved the lowest FNR, and a FDR of no more than 0.05. The BaySeq method
provided a conservative control of FDR, and achieved an empirical FDR of lower than
0.001 when controlling the FDR at 0.05. The DESeq, EBSeq and TSPM methods failed to
control the FDR at 0.05. The SAMSeq method and TSPM method failed to identify most
of the truly DE genes as DE genes, which was not surprising as the performance of
both the SAMSeq and TSPM methods is highly sample size dependent as pointed out by
Soneson and Delorenzi (2013) 29].

Table 3. Performance evaluation under Study II (Setting 1), (G=5000, 500 DE gene)
#

Setting 2 (RNA-Seq experiment)

We used a similar simulation setting proposed by Kvam et al. 16] for illustrating the application of the proposed confident difference criterion method
for RNA-Seq experiment. We still simulated 50 dataset, each dataset contained six
libraries with three libraries from each of the two conditions on 5000 genes, among
which 250 genes were set to be up-regulated genes and another 250 genes were set to
be down-regulated genes in condition 2 versus condition 1. The overall mean expression
levels across both conditions were generated from a gamma distribution with . To avoid including genes with low expression levels from both conditions as DE genes,
we set the difference in the gene expression levels between conditions in two ways
depending on whether the value of ? g
is larger than one. Specifically, we generated ? g
from uniform distribution for each gene. If the value of ? g
1, we let the fold change between the gene expression values of DE genes to be ? g
, or and for up-regulated genes, and and for down-regulated genes. If the value of ? g
?1, we let the absolute difference in the gene expression values to be ? g
, or we let ? g1
=? g
+? g
and ? g2
=? g
for down-regulated genes, and ? g1
=? g
and ? g2
=? g
+? g
for up-regulated genes in condition 2. For an EE gene, we had ? g1
=? g2
=? g
.

Then we generated the data using negative binomial distribution of for gene g, and the overdispersion parameters ?1
and ?2
were set to have ?1
= 1 and ?2
= 8 respectively for DE genes; and ?1
=?2
=4 for EE genes.

All methods applied in setting I of simulation study II were also used for data analysis
in this simulation study. The results in Table 4 displayed that the confident difference criterion method with a control of FDR at
0.05, the edgeR method with common dispersion parameter over genes, the edgeR with
gene-wise dispersion parameter, the BaySeq, the NBPSeq, the NOISeq methods successfully
controlled the FDR at 0.05. Additionally the confident difference criterion method,
the NBPSeq method, the edgeR method with a common dispersion parameter over genes
also provided a good and comparable control of FNR of less than 0.2, while maintaining
low levels of FPR and FNDR.

Table 4. Performance evaluation under Study II (Setting II), (G=5000, 500 DE gene)
#

Real data analysis

We used a real data set obtained using customized Bovine Affymetrix arrays (Davis,
Talbott, Yu, and Cupp, unpublished results) to illustrate the proposed method. Fifteen
arrays composed of three replicate arrays under three biological conditions were produced
to screen for DE genes associated with prostaglandin F2?(PGF) treatment after 30 min, 1 h, 2 h, and 4 h compared to the control treatment
(saline). For simplicity, we focused on detecting genes using the confident difference
criterion methods (Method I and Method II) that were regulated 1 h or 2 h after PGF
treatment. The data were extracted, normalized and summarized using the Robust Multi-array
Average (RMA) 12] method at the exon level via the Affymetrix expression console. The data set contains
21724 genes. Note that some genes may have multiple probe replicates ranging from
one replicate to 266 replicates, and the data from different probes of the same gene
may have large variation even after RMA normalization. We centered the data from each
probe of the same gene to the mean log intensities of that gene, and excluded 3116
genes with only a single probe replicate from the analysis to make sure that the parameters
were estimable. Additionally, we excluded 2137 low expression genes if two-thirds
or more (six out of nine) samples on this gene had gene expression values measured
by the geometric mean expression values across different probes less than 10. Of the
remaining 16471 genes with replicate probes, we used z gjtk
to denote the k t h
biological replicate sample of the log2 scale gene expression intensity for probe
j of gene g under condition t. Note that the index j was added to the previous notations for the log intensity values as data are available
for multiple probes on the same gene. We assumed normal distribution for the log2
intensities with , and the same prior for ? gtk
as what we set for Xgt
in the Model for microarray data subsection. The variance parameters are assumed to
follow inverse gamma distribution with with and . We set and where both ??
=??
=1. During computation for controlling the FDR, we reuse these settings of the prior
distributions on the parameters ? gtk
and for DE genes. For EE gens, we assume that , and make similar augment for the prior distributions of their parameters ? gtk
and as the DE genes. The proposed confident difference criterion methods were applied
to assess the evidence of differential expression on each gene and identify DE genes
with the cutoff value equal to be 0.4, 0.6 or a value that controls the FDR at 0.05.

In addition, we analyzed the real data using the existing methods including SAM, LIMMA,
and EBarrays as described in the Simulation Studies section for identification of
DE genes. Since the existing methods were developed for data with single probe replicate
on each gene, we calculated the mean log intensities over all probes for each biological
sample on each gene to quantify the corresponding gene expression. The genes were
declared to be DE if the false discovery rate was no more than 0.05. We used Venn
diagrams to demonstrate the overlap of DE genes identified by Method I (Fig. 2, Left Panel) or Method II (Fig. 2, Right Panel), to the DE genes identified by SAM and EBarrays (Fig. 2). The results showed that more genes were identified to be DE by the proposed Method
I and Method II than the existing methods. Specifically, 1050 DE genes were identified
by Method II, while 896 genes were identified to be DE by either SAM or EBarrays.
Of note 340 out of 353 DE genes identified by LIMMA were also identified by SAM (data
not shown), and 951 of 991 DE genes identified by Method I were also identified as
DE by Method II. We found that SAM identified 375 DE genes, all of which were also
identified by other methods. For example, 358 (95.5 %) genes identified by SAM were
also identified by Method I or II; and 342 (91.2 %) genes identified by SAM were also
identified by EBarrays method. The EBarrays method identified 863 DE genes, out of
which, 643 (74.5 %) genes were also identified by Method I or II. Method I identified
116 of the 324 genes identified by LIMMA when comparing all four time points versus
control in the same dataset, while Method II called 105 out of 387 genes DE that were
also called DE by LIMMA within the whole dataset. In addition, many genes identified
to be DE only by Method II not by Method I show a linear trend among the average gene
expression across conditions observed from samples collected with longer time after
treatment, and larger data variations under the control condition than those observed
at other time points after treatment. For example, the average log2 gene expression
of THBS1 increased from 9.22 under control condition to 10.35 at 2h after treatment,
and the standard deviation equaled 0.88 under the control condition, and 0.37 at 2
h after treatment. This gene was only detected to be DE by Method II and was shown
to play roles in angiogenesis 37].

Fig. 2. Number of identified DE genes out of 16471 genes from real data analysis. Two venndiagrams
present the overlapping among the DE genes identified separately by Method I/II, SAM,
and EBarrays with the false discovery rate controlled at 0.05 from the real data

The genes identified solely by Method I or Method II were analyzed by Ingenuity Pathway
Analysis (IPA, Build version: 313398M, Content version: 18841524 (Release Date: 2014-06-24)
to determine biological functions and pathways associated with the newly identified
genes. Genes identified solely by Method I and not by SAM or EBarrays were analyzed
by IPA which identified several major canonical pathways such as hepatic fibrosis
/ hepatic stellate cell activation, glucocortiocoid receptor signaling, agranulocyte
adhesion and diapedesis, and role of IL-17A in arthritis (Additional file 2: Table S1). Many of the canonical pathways identified have either established or
potential roles in corpus luteum function indicating that Method I identified DE genes
that are biologically relevant within the model. Method I also identified IL1B (P=2.12E?08) and TNF (P=3.03E?08) as upstream regulators of the genes found exclusively by Method I, which also
fits with known and suspected mechanisms of PGF action within the corpus luteum 1], 24].

Genes identified solely by Method II were also analyzed by IPA which identified canonical
pathways such as hepatic fibrosis/hepatic stellate cell activation 21], glucocorticoid receptor signaling, IL-8 signaling, and granulocyte adhesion and
diapedesis. Upstream regulators of gene found solely by Method II included: IL1B (P=4.56E?13), TGFB1 (P=1.19E?11), and IFNG (P=1.82E?11). The IPA results both concur with current literature and offer new insights into
the possible mechanism(s) of action of PGF in the corpus luteum 1], 9], 11], 21]. These and similar canonical and regulatory functions were also identified when the
complete dataset (30 min, 1 h, 2 h, and 4 h) was analyzed by IPA. These network functions
are in agreement with the known or suspected changes in biological function in the
corpus luteum following PGF treatment in several species 1], 5], 22], 27]. Several of the genes identified by Methods I and II are known to be involved in
regulation of the fate of the corpus luteum after PGF treatment, and were also identified
as DE genes in our larger data set and a similar microarray dataset examining the
effects of PGF in the cow 22]. For example, genes that code for chemokines (e.g., CCL3 and CCL8) and prostaglandin
synthesis (e.g., PTGS2) were found to be significantly up-regulated at 1 and 2 h using
Methods I and II which were not identified using LIMMA. However, CCL3, CCL8, and PTGS2
were all identified as significantly up-regulated in later time points using LIMMA,
which conservatively identifies DE genes. Therefore, it seems possible that Methods
I and II may provide a more sensitive approach to identify the temporal patterns of
gene expression.