Effects of subsampling on characteristics of RNA-seq data from triple-negative breast cancer patients


The purpose of our study was to learn about the influence of the sequencing depth
on inferred biological results. For this reason, we investigated 4 layers of complexity.
First, we compared differences between an explicit subsampling of reads and a direct
scaling of count matrices. The results from this analysis demonstrated that a subsampling
via DESIRE was necessary to obtain realistic surrogates of sequencing experiments
with a smaller sequencing depth. Second, we studied the absolute expression of genes
and their growth. Third, we investigated the growth rate of the number of expressed
genes. Fourth, we analyzed differences in the distributional shape of expressed genes
between TNBC patients and TNBC-free patients. For each of these analysis steps, we
used data generated by the DESIRE procedure.

Differences between subsampling of reads and direct scaling of count matrices

Our first analysis investigated differences between a subsampling of reads via the
DESIRE procedure and a direct scaling of count matrices. The results of this analysis
justified our approach for the following sections.

The basic idea of DESIRE is to draw randomly aligned reads, as provided by a Sam file,
and create a new auxiliary Sam file corresponding to a new sequencing experiment with
a smaller sequencing depth. We compared this with a direct scaling of count matrices,
whereas the scaling was obtained by multiplying the components of the count matrices,
c ij
, with a constant factor f that corresponds to the simulated sequencing depth because

(4)

Hence, this simple scaling of a count matrix resulted in the desired simulated sequencing
depth for a sample.

For one TNBC-free sample (SRR1313211), the difference between counts obtained via
our DESIRE procedure and the direct scaling method of count matrices is shown in Fig. 4. Specifically, the number of expressed genes (Y axis), depending on the sequencing depth f (X axis), for different values of a threshold parameter is presented in Fig. 4. By the number of expressed genes, we meant the number of genes that have a short
read count c ij
of ??{1, 10, 50, 100} or larger, i.e., c ij
 ? ?, where ? is the threshold parameter. All results are for raw count values, not
normalized values, and each dot corresponds to the result from one data set.

Fig. 4. Comparison of the subsampling of reads via the DESIRE procedure (blue) and a direct scaling of count matrices (red). The obtained numbers of expressed genes depending on the sequencing depth for four
different threshold parameters (1, 10, 50, 100) are shown.

For all threshold values and all sequencing depths that we investigated, there were
distinct differences between the two approaches (Fig. 4). Similar results were also observed in other patient samples (not shown). From these
results, we concluded that the computationally efficient shortcut via a direct scaling
of count matrices did not lead to the same results as the DESIRE procedure. Hence,
the scaled count matrices did not correspond to sequencing experiments with a smaller
sequencing depth but had an unclear biological interpretation. For this reason, the
DESIRE procedure needs to be used for simulating realistic sequencing experiments
because only in this way do the resulting data have a clear interpretation in biological
terms. In the following sections, we used the DESIRE procedure for this purpose.

We would like to note that neither our statistic, the number of expressed genes, nor
the specific threshold ? was crucial for our conclusion, but other statistics led
to similar results. For our following analysis, it was important only that there was
a difference but not how each individual measure was affected. However, we thought
that for particular measures that were used, e.g., as test statistic for hypothesis
tests or distance metrics for clustering, it might be interesting to quantify these
differences more specifically.

Absolute expression of genes

In this analysis, we studied the influence of the sequencing depth on the number of
expressed genes. The results for a TNBC-free patient (SRR1313211) and a TNBC patient
(SRR1313133), exemplary for all samples studied, are shown in Fig. 5; the number of expressed genes (Y axis), depending on the sequencing depth f (X axis) for different values of a threshold parameter ??{1, 10, 50, 100}, are also
presented. All results are for raw count values, not normalized values, and for each
sequencing depth f, we generated R = 24 subsampled data sets for which box plots are shown.

Fig. 5. Triple-negative breast cancer (TNBC)-free sample SRR1313211 and TNBC sample SRR1313133.
The number of expressed genes for different filtering thresholds (1, 10, 50, 100)
depends on the sequencing depth. The blue curves correspond to fitted Gompertz functions. All results are for raw (unnormalized) count
values.

The first impression of the overall behavior was intuitive because the larger was
the sequencing depth, the higher was the probability to obtain at least ? reads for
a gene, if it was expressed. Less intuitive was the fact that for all samples and
all thresholds, there was no saturation in the number of expressed genes, but this
number continued to grow, which suggests that even the maximally available sequencing
depth was not sufficient to achieve a saturation of the measurements. In addition,
this pointed to possible errors in either the sequencing or the alignment of reads
because it was biologically implausible to assume that almost all 23,648 genes considered
by our analysis were actually expressed for ? = 1 (Fig. 5). This may open the possibility to quantify such errors statistically.

From the obtained results in Fig. 5 and the results from 6 further samples that looked qualitatively similar (not shown),
we attempted to estimate the optimal sequencing depth in the following two ways using
the available sequencing depth of the samples used for our analysis (TNBC samples:
34974017, 46677107, 17574408, and 24440340; TNBC-free samples: 25900791, 43454785,
31426867, and 33517581). Estimator (I)—average sequencing depth: the first estimator
centers on average properties of our samples. Given that the average number of short
reads per sample was 32,245,737 ± 9,710,593 (averaged over the 8 samples) and the
fact that none of the growth curves saturated, we estimated that the average number
of reads necessary for a saturation must be larger than 32,245,737. Estimator (II)—individual
sequencing depth: the second estimator centers on the individual samples. The largest
sequencing depth of our samples was 46,677,107, and even this sample did not lead
to a saturating growth. Hence, a conservative estimate requires an individual sequencing
depth larger than 46,677,107.

The variability of all results, e.g., the interquartile range (IQR) of the box plots,
was in general quite small. However, for larger ? values, the IQR was even further
decreased, which showed that the estimation for the number of expressed genes was
even more stable for larger expression threshold values, corresponding to a more stringent
filtering for expressed genes.

For a quantitative comparison between the TNBC and TNBC-free patient samples, we compared
the mean of median values of the number of expressed genes, for different sequencing
depths f, to test the null hypothesis:

(5)

by a two-sample t test. Each comparison was based on 4 samples per condition. Here, for instance, median
TNBC|f
indicates the conditional median value of TNBC patients, conditioned on the sequencing
depth f. The other conditional symbols have a similar meaning.

The results of these hypothesis tests are shown in Table 1. For a significance level of ? = 0.05, only one result for a left-sided test was
significant for f = 0.1. However, all other P values from the left-sided comparison were approximately 5%, indicating a tendency
of being different but not significantly. This is plausible because we know that the
samples from TNBC and TNBC-free patients corresponded to two different physiological
conditions but that these differences affected some, but not all, biological processes,
e.g., the hallmarks of cancer 33]. Hence, if samples are compared as a whole, as in our case, using only the mean of
the medians of the number of expressed genes as a test statistic and not adjusting
for different types of biological processes, e.g., using information from the gene
ontology database 34], this signal is too weak to be detected. On the other hand, we found that the number
of expressed genes in TNBC patients was smaller than that in TNBC-free patients because
there was a clear asymmetry between the left- and right-sided P values, always leading to the relation

(6)

Table 1. Results of two-sample t tests comparing the total number of expressed genes for various sequencing depths

This relation indicated that, on average, there were fewer genes expressed in TNBC
patients than in the corresponding control samples, independent of the sequencing
depth.

Growth rate of the number of expressed genes

Next, we compared the growth of the number of expressed genes depending on the sequencing
depth (Fig. 5). For this reason, we fitted Gompertz growth functions 35] given by

(7)

Here a, b, and c are parameters of the Gompertz function to be fitted and c is called the growth rate. For our quantitative comparison, we used the fitted values of c.

We used Gompertz growth functions because the number of (expressed) genes of an organism
was limited and, hence, so was the increase in the number of genes having more than
a certain threshold needed to saturate. Growth curves, such as the Gompertz function
or the logistic function 36], 37], have the natural constraint of being limited from above and, hence, provide a natural
choice for a constrained regression function. Table 2 shows the growth rates and their standard deviations for all 8 samples and the 4
threshold values, ?? {1, 10, 50, 100}.

Table 2. Fitted growth factor values and standard deviations for the Gompertz functions

From a visual inspection, there were only slight differences between the different
conditions. For this reason, we quantified the results to test the null hypothesis
that there was no difference in the values of the growth rates, i.e.,

(8)

for depth by a two-sample t test. Again, each comparison was based on 4 samples per condition.

To identify direction-specific effects, we also performed hypothesis tests for two-sided,
left-sided, and right-sided comparisons. The results of these hypothesis tests are
shown in Table 3. Overall, for a significance level of ? = 0.05, none of these hypothesis tests was significant. However, the right-sided
P values were not much larger than 0.05, hinting at a tendency in the data to be different,
like the comparison of the median number of expressed genes above.

Table 3. Results from comparing the growth rates of the fitted Gompertz functions for TNBC
and TNBC-free patients

A normalization of the data does not remove the growth property observed in Fig. 5, but normalized data exhibit qualitatively the same behavior. For ? = 1, this was
obvious because the normalization led to a scaling of the data without changing the
zero values. For ?  1, it was less intuitive but followed from our numerical analysis
(results not shown).

Distributional shape of expressed genes

Last, we studied the distributional shape of expressed gene values (and not of their
numbers) by estimating individually for each parameter configuration its mean value,
variance, skewness, and kurtosis. Here, we mean the distribution over all genes within
a sample, and not the count distribution of individual genes across samples. Because
every distribution with existing moments was fully characterized by all of its moments,
either via its moment-generating function or via its probability generating function
38], 39], our analysis was an approximation of the distributional shape because we limited
our focus to 4 dimensions.

Specifically, for each condition (TNBC versus TNBC-free) and each sequencing depth
(f ? {1, 10, 50, 100}), we generated R = 24 data sets, giving a total of 432 data sets, and applied the expression threshold
? = 1 to each data set as a filter. In the following analysis, we distinguished between
CPM normalized and raw (unnormalized) data by estimating the mean, variance, skewness,
and kurtosis of the distributions of expression values of the genes. The results of
this analysis are shown in Fig. 6 and Tables 4 and 5, which include results for raw (unnormalized) data in Columns 3 and 4. The first
observation from Fig. 6 is that a normalization of the data was absolutely necessary to obtain stable results
across different sequencing depths. This is clearly visible for the mean and variance
values because they showed increasing values for larger sequencing depths. In this
respect, even a simple CPM normalization counterbalanced this effect, leading to stable
expression patterns across different sequencing depths. This also illustrated that
the choice of normalization method affected the statistical properties of a distribution
and the results of statistical inference significantly, such as differential gene
expression analysis, which was also observed 15]. From a visual comparison of the moments for TNBC and TNBC-free patients, we observed
clear differences between the variance, less clear differences for the kurtosis and
neutral differences for the mean and skewness. For a quantification of the comparison
between the moments for TNBC and TNBC-free patients, we tested the following null
hypothesis by a two-sample t test:

(9)

for depth f and m?{mean, variance, skewness, kurtosis}, indicating the four moments we studied. Each
comparison was based on nine samples per condition because we pooled the median values
across the different sequencing depths for each condition and each measure m. The results of this analysis are shown in Table 6. Overall, the mean values were essentially undistinguishable (with P values of approximately 1.0) but the other three moments were significantly different
at a two-sided significance level of ? = 0.05. Specifically, for the kurtosis and
skewness, the left-sided tests were significant; for the variance, the right-sided
test was significant. That means that for kurtosis and skewness, the values of the
moments were higher in TNBC-free patients than in TNBC patients, whereas for variance,
these values were lower. This result is interesting because, commonly, a disease is
associated with instability or disorder, but a decreasing variance suggested less
variability in the expression values of the genes.

Fig. 6. Results for the 4 moments: mean, variance, skewness, and kurtosis (rows). Columns 1 and 2: normalized data; Columns 3 and 4: raw data; Columns 1 and 3: TNBC patients; Columns 2 and 4: TNBC-free patients.

Table 4. Moments for TNBC-free patients

Table 5. Moments for TNBC patients

Table 6. Results from pooled (across different sequencing depths) two-sample t tests for the 4 moments of the gene expression distributions