deGPS is a powerful tool for detecting differential expression in RNA-sequencing studies

Overview of deGPS

To identify biologically important changes in RNA expression, we propose a more accurate
and sensitive two-step method for analyzing sequence count data from RNA-Seq experiments
(Fig. 1). Here, we implement our method in an R statistical package, termed “deGPS” (https://github.com/LL-LAB-MCW). To speed up permutation tests, deGPS also provides efficient parallel computation
using multi-core processors. In Step 1, two different methods based on the GP distribution,
namely GP-Quantile and GP-Theta, were developed for normalizing sequence count data.
These two GP-based methods differ in parameter estimation and data transformation.
Generally, GP distributions fit sequence count data better than NB distributions on
transcripts over a wide range of relative abundance in RNA-Seq experiments (Fig. 2). Other commonly used normalization methods including global, quantile 17], locally weighted least squares (Lowess) 18], and trimmed mean method (TMM) 19] for high-throughput data, as is used for microarrays, can be also adopted in deGPS.
The latter normalization methods are based on either linear scaling or sample quantiles
instead of modeling sequence count data. Normalization in Step 1 removes potential
technical artifacts arising from unintended noise, while maintaining the true differences
between biological samples.

Fig. 1. Overview of deGPS for analyzing sequence count data in RNA-Seq

Fig. 2. Modeling sequence read counts from RNA-Seq with the NB and GP distributions. a Read counts fitted by the NB and GP distributions and b QQ plots

After data normalization, DE detections are performed in Step 2. We employ the empirical
distribution of T-statistics to determine the p-values of DE tests. To obtain empirical
distributions, we first randomly shuffle the samples between groups, then calculate
T-statistics in permuted samples, and finally merge T-statistics from all transcripts
without any averaging as one whole empirical distribution. The number of transcripts
analyzed in a typical RNA-Seq experiment is often large, ranging from hundreds to
ten of thousands. Using this sampling strategy, reliable empirical distributions can
be obtained in small sample sizes. The permutation-based DE test in Step 2 is robust
and powerful when sample size is small.

Simulation strategies

To evaluate the performance of deGPS, we conducted comprehensive simulations under
a range of scenarios comparable to recent RNA-Seq studies. The advantages and disadvantages
of each tool are difficult to elicit for a particular small data set. Therefore, we
first simulated sequence count data from two large-scale RNA-Seq studies from The
Cancer Genome Atlas (TCGA), including 491 miR-Seq libraries (Additional file 1) and 100 mRNA-Seq libraries in human lung tumor tissues (Additional file 2).

To estimate type I error under null hypothesis, we randomly sampled the same number
of subjects from our downloaded RNA-Seq datasets into two groups each with 5 subjects.
Type I error is defined as the proportion of transcripts with nominal p-values less
than 0.05 from statistical tests under null hypothesis. To estimate FDR and true positive
rate (TPR) (i.e., statistical power) under alternative hypotheses, we first randomly
generated two groups of samples and randomly chose a subset of transcripts. Subsequently,
we made two types of changes in the selected transcripts to create DE between two
groups. In the “shift” transformation, we added varied quantities (with variations
as one fifth of the added values) of read counts into the selected transcripts in
either group. In the “scaling” and “shift” transformation, we multiplied the read
counts of selected transcripts by varied quantities (with variations as one fifth
of the multiplied values) after applying the “shift” transformation (Additional file
3). In our deGPS method, nominal p-values were adjusted by the Benjamini-Hochberg procedure
20]. FDR is defined as the proportion of transcripts identified by a statistical test
with a significance level of 0.05 (i.e., adjusted p-values??0.05) that are indeed
false discoveries (i.e., non-DE transcripts); TRP is defined as the proportion of
DE transcripts identified by a statistical test with a significance level of 0.05.
Each simulation was replicated 1,000 times.

In the real data-driven simulations, sequence count data were normalized by GP-Theta
or GP-Quantile methods before applying our permutation-based DE tests. For the purpose
of comparison, we also included in the simulation four other normalization methods
(namely, Global, Lowess, Quantile, and TMM) that are not based on the GP distribution,
but are commonly used for high throughput data such as those from microarray 21],22]. Our DE tests were then applied to the normalized data generated by all of these
methods (Additional file 4). We also chose four additional tools, edgeR (v3.6.7), DESeq (v1.16.0), DESeq2 (v1.4.5),
and SAMseq (v2.0) which are currently among the top performers of differential analysis
of sequence count data 16]. Prior to DE tests, edgeR performs TMM, relative log expression (RLE) or upper quartile
for data normalization in its own R package 19]. DESeq and its variant (DESeq2) use a similar RLE approach for data normalization
by creating a virtual library that every sample is compared against 5]. Similarly, nominal p-values output from these R packages were adjusted by the Benjamini-Hochberg
(BH) procedure for evaluating FDR and TPR 20]. Note that edgeR has multiple user-defined parameter settings while both DESeq and
DESeq2 were applied by default setting. We present the results from the most commonly
used TMM normalization with glmLRT (named edgeR1) and glmQLF tests (named edgeR2),
which generally have better performance than the other setting (Additional file 3). SAMseq were implemented with default parameter setting.

In addition to the above data-driven simulation strategy, we also used compcodeR for
benchmarking of DE analysis methods 23]. The compcodeR package provides functionality for simulating realistic RNA-seq count
data sets and an interface for implementing several commonly used statistical methods
such as DESeq and edgeR for DE analysis. We set the proportion of upregulated transcripts
as 50 %, set sample size as 5, 8, and 10 subjects per group, and introduced 0, 0.5,
1.0 and 2.0 % probability of random outliers to model abnormally high counts in RNA-Seq
studies. All of the other parameters are default. compcodeR-based simulations were
replicated 100 times in each scenario. Type I error, FDR, TPR and AUC were evaluated
and compared by its own functions in the compcodeR. It is worth noting that compcodeR
simulates sequence count data from NB distributions, which potentially favors DESeq
and edgeR.

To evaluate different FDR adjustment methods, we introduced R package fdrtool 24] to further compare BH method 20] to area-based FDR (QVAL) and density-based FDR (LFDR) in compcodeR-based simulations.
We took permutation T-statistics instead of p-values in deGPS as the input of fdrtool
and extract the QVAL and LFDR from the output. Since sample sizes in RNA-Seq experiments
are typically small, the estimated variances and their associated T-statistics used
in permutation tests are probably highly variable. We thus compared ordinary T-statistic
to regularized T statistic in permutation tests for DE detection. Regularized T-statistic
was implemented in R package st 25].

Type I errors and false positive rates

We first evaluated type I error and FDR of different methods in datasets simulated
from two large-scale RNA-Seq studies, including 491 miRNA and 100 mRNA TCGA samples
(Fig. 3). FDR is used for quantifying the rate of false discoveries when multiple hypothesis
testing is concerned especially in RNA-Seq experiments. Among these methods, only
three methods (GP-Theta, TMM and DESeq) can precisely control both type I error and
FDR in both miRNA and mRNA datasets. SAMseq has correct type I error and FDR in miR-Seq
dataset, but inflates type I error and FDR in mRNA-Seq dataset. DESeq is the most
conservative among these methods in terms of type I error and FDR. Its variant DESeq2
becomes less conservative but leads to higher FDR than expected. edgeR appears to
be unable to control both type I error and FDR in all scenarios (Fig. 3 and Additional file 5).

Fig. 3. Type I error and false discovery rate. Data were simulated from large-scale TCGA lung
cancer sequencing studies, a miRNA and b mRNA. Two different types of data transformation, “shift” and “scaling shift” were
applied. Boxplots summarize type I error and false discovery rate of different statistical
methods for DE detection under a wide range of simulations. Methods in red font are
those do not have correct type I error and/or false discovery rate

Six of these methods (GP-Theta, GP-Quantile, Global, Lowess, Quantile, and TMM) use
different strategies of data normalization, but use the same DE tests as deGPS. They
yield very different type I error and FDR. Only GP-Theta and TMM are able to control
both type I error and FDR at the desired level; whereas the other four methods have
inflated type I error and/or FDR. These results suggest data normalization has substantial
impacts on the performance of DE tests in terms of type I error and FDR.

True positive rates

Next, we evaluated TPR (i.e., statistical power) of different methods in these RNA-Seq
datasets (Fig. 4 and Additional file 6). These methods show different TPR among different RNA-Seq datasets. GP-Theta consistently
produces the highest TPR among the methods that also have correct type I error and
FDR in both miRNA and mRNA datasets. SAMseq has roughly similar TPR to deGPS without
regard to the consequence of type I error and FDR. DESeq2 has improved TPR, but at
the cost of inflated type I error and FDR, when compared with its original version
DESeq. Generally, edgeR has high TPR but also exhibits high FDR too.

Fig. 4. True positive rate. a miRNA and (b) mRNA. True positive rate (TPR) can be interpreted as statistical power

We also observed that data normalization dramatically influences statistical power
of DE tests. Although the same DE tests were applied after data normalization, six
different normalization methods result in varied TPR. Besides GP-Theta, TMM performs
reasonably better than the other four methods.

Sensitivity and specificity

We compared deGPS with other methods in terms of sensitivity and specificity in these
two RNA-Seq studies. We thus calculated the receiver operating characteristic (ROC)
curve and area under curve (AUC) of different methods to measure their sensitivity
and specificity (Fig. 5 and Additional files 7). For the clearer presentation, AUC with false positive rate (FPR) less than 0.05
was calculated. In general, SAMseq, GP-Theta, DESeq2 and TMM are the top four performers
for DE analysis of sequence count data according to the AUC metric. Among the methods
that have correct type I error and FDR, GP-Theta performs the best as it has the largest
AUC. DESeq2 often has higher AUC than its original version DESeq. Generally, DESeq
and its variant DESeq2 perform better than edgeR in mRNA datasets in terms of AUC,
whereas their performances are comparable in miRNA datasets. The normalization methods
other than GP-Theta and TMM usually result in lower AUC.

Fig. 5. Sensitivity and specificity. a miRNA and b mRNA. The AUC with false positive rate less than 0.05 was calculated. Boxplots summarize
AUC values from a wide range of simulation settings. TPR, true positive rate; FPR,
false positive rate

Benchmark data

We further compared deGPS with SAMseq, DESeq and edgeR using compcodeR. compcodeR
is an R package for benchmarking of DE analysis methods, in particular methods developed
for analyzing RNA-Seq data 23]. In the analysis, deGPS with GP-Theta normalization, SAMSeq, DESeq and DESeq2, and
edgeR1 and edgeR2 were evaluated in benchmark data (Fig. 6 and Additional file 8).

Fig. 6. Benchmark data from compcodeR. Type I error rate, FDR, TPR and AUC are evaluated under
0, 0.5, 1 and 2 % of outliers in RNA-Seq data. Sample size is 5 subjects per group

In compcodeR-based simulations, both deGPS and SAMSeq consistently control both type
I error and FDR and are robust against the occurrence of random outliers in RNA-Seq
experiments; whereas DESeq2 and edgeR1 are not able to control type I error and/or
FDR in most of scenarios. DESeq is still conservative in terms of type I error, but
its ability of FDR control varies among different levels of random outliers and samples.
edgeR2 generally performs much better than edgeR1 in terms of FDR control in compcodeR-based
simulations. edgeR1 is based on generalized linear model in which regular likelihood
ratio test (LRT) is performed; whereas edgeR2 replaces the Chi-square approximation
to the LRT statistic with a quasi-likelihood F-test 26].

In terms of TPR and AUC, SAMseq performs slightly better than deGPS, but the difference
between these two methods becomes small when increasing sample sizes from 5 to 8 subjects
per group (Fig. 6 and Additional file 8). edgeR2 performs similarly to deGPS in RNA-Seq data without random outliers or very
low proportion of outliers (i.e., 0.5 %) . However, deGPS outperforms edgeR2 when
random outliers increase up to 1 % in RNA-Seq data. Interestingly, deGPS achieves
similar TPR under different levels of random outliers, suggesting it is a robust approach
for DE analysis in the presence of abnormal high sequence read counts in particular
transcripts in RNA-Seq experiments. It should be also noted that both DESeq and edgeR
model sequence count data with NB distribution; whereas deGPS is based on GP distribution.
Therefore, compcodeR benchmark analysis that simulates sequence count data from NB
distributions may favor DESeq and edgeR and thus overestimate their performance as
compared with deGPS in real RNA-Seq data.

We also evaluated effects of different FDR adjustment methods on the performance of
deGPS. The median FDR of QVAL or LFDR is a little smaller than that of BH method although
the later can precisely control both FDR and type I error. QVAL and LFDR do not always
outperform BH method when repeating simulation in each scenario as they have much
bigger interquartile range in the boxplot (Additional file 9). It may worth further investigation why the performances of these three FDR adjusting
methods differ from case to case. Finally, we compared ordinary T-statistic with regularized
T-statistic in permutation tests for DE detection. The simulation results showed that,
based on deGPS-transformed data, ordinary T-statistic has a little higher TPR and
is generally comparable with regularized T-statistic in terms of type I error and
AUC (Additional file 9).

Real data analysis of the developmental transcriptome of Drosophila

In addition to simulated datasets, we also analyzed the developmental transcriptome
of Drosophila melanogaster (Fig. 7 and Additional file 10) 27]. We compared six different methods (i.e., deGPS, SAMseq, DESeq, DESeq2, edgeR1 and
edgeR2) to identify genes that were differentially expressed between four development
stages of Drosophila, which include early embryo (0 to 12 days), late embryo (13 to
24 days), larval and adult stages. Each stage contains 6 RNA-Seq samples. The RNA-Seq
read count data in 14,869 genes from these 24 samples were downloaded from http://bowtie-bio.sourceforge.net/recount28]. Prior to the analysis, we filtered genes without any read counts in all samples
from any two compared groups. Similar to the above simulations, the BH procedure was
used to control FDR 20], and all genes found to be DE at a FDR threshold of 0.05 were considered significantly
DE. As expected, there were a large number of developmental-regulated DE genes in
early embryo development, compared with adult Drosophila. Generally, edgeR1 and DESeq2
identified the largest number of DE genes than the other methods. This is perhaps
due to their failure in controlling FDR, as observed in simulations. DESeq is the
most conservative and identified the smallest number of DE genes among these methods.
edgeR methods show extremely high concordance; all of DE genes that were identified
by edgeR2 were identified by edgeR1. Similar observations are also true in DESeq methods;
about 99 % of DE genes that were identified by DESeq were identified by DESeq2. Approximately
70, 70 and 87 % of DE genes found by deGPS overlap with SAMseq, edgeR1 and DESeq2,
respectively.

Fig. 7. Analysis of the development transcriptome of Drosophila Melanogaster. Four development
stages (early embryo, later embryo, larval and adult) were analyzed (Graveley, et
al., 2011). The numbers of genes differentially expressed between two adjacent stages
are presented at a FDR threshold of 0.05. The “overlap proportion” is calculated as
dividing overlap numbers by its column’s DEs

Next, we evaluated the ability of the above methods to control type I error and false
positive numbers. We randomly assigned equal number of subjects (without replacement)
from the same development stages into two groups of 5 subjects each. Each group contained
equal number of subjects from the same development stages and thus had similar gene
expression profiles. Therefore, we expected that no genes are truly DE when comparing
these two synthetic groups. Nevertheless, among 100 simulations, these methods identified
DE genes ranging from 16 to 277 false positives per genome scan. deGPS found the lowest
number of false positives, whereas edgeR1 found the highest number of false positives.
edgeR1 inflates type I error, whereas the other four methods can control type I error
at the desired level (Additional file 11).