Replicate exome-sequencing in a multiple-generation family: improved interpretation of next-generation sequencing data


Direct comparisons of exome-capture samples in triplicate

We compared the results of independent exome-capture in triplicate for six individuals
from three generations (Fig. 1), evaluating them for high-quality bases (Q20) aligned to a reference human genome
(hg19). We compared the alignment results either between technical replicate samples
of a single individual or between individuals within the pedigree by utilizing pooled
data of all technical replicates for each individual. In total, 18 exome data sets
from six individuals were evaluated. Illumina HiSeq2000 sequencing generated 53–98
million paired-end (PE) 100-bp reads per technical replicate library to produce 48x
mean alignment target depth of high-quality mapped bases (Q20) for each technical
replicate library (Table 1). This generated 5–9 Gb of high-quality target-aligned data per technical replicate
and cumulatively 20–22 Gb (206–235 million PE 100-bp reads) to give 179 – 208x average
target coverage for each individual in the pedigree (Additional file 1: Table S1).

Fig. 1. Independent technical replicate target exome capture and aggregate data alignment
results for a six member multiple generation family. a Probability density function plots of three independent library targeted exome capture
experiments as a function of depth of sequencing per targeted base (X) categorized
by each member of the six member 3 generation family. b Mean depth of coverage as a function of total bases aligned in targeted exome region
(62 Mb) for all 18 technical replicates, and aggregate data for 6 individuals derived
by merging 3 technical replicate captures per individual. c Percent targeted bases sequenced at ?1x, ?10x, and ?20x thresholds as a function
of total number of bases aligned in targeted exome region (62 Mb) for technical replicate
and aggregate data for each individual. Black lines show the predicted local polynomial regression (loess) fit to data with default span value of 0.75, and red dashed lines represent predicted 95 % confidence interval along the predicted line

Table 1. Whole exome sequencing mean target coverage depth and percent target coverage statistics
of replicate and aggregate data

Technical replicate data were compared directly for each individual (Fig. 1a). The mean target-depth of sequencing varied linearly with the input of total sequence
data and was evident for all technical replicates derived from the six individuals
(Fig. 1b; Table 1). The most variable technical replicate depth of sequencing results was from individual
ID4385 (51x-86x) and the least variable was from individual ID3866 (69x-71x). Upon
aggregation of data from all technical replicates for each individual, the depths
of coverage were 179 – 208x for targeted regions (Fig. 1b).

To determine sequencability of targeted bases, we determined the percentage coverage
with ?1x to ?100x (increments of 10x) in each technical replicate and in each aggregate
data set (Additional file 1: Table S2 and Figure S2). This analysis identified minimal variability at ?1x coverage
but appreciable variability at ?20x coverage (81.2 – 89.6 %) among the technical replicates
of each individual (Fig. 1c). Greater variability was observed at higher (?30x to ?100x) depth of sequencing
thresholds (Additional file 1: Figure S2). Subsequently we restricted our analysis to ?1x, ?10x and ?20x thresholds.
We observed that this variability was a function of the total number of bases aligned
to target regions. Therefore, we used a local polynomial regression (loess) package in the R statistical software to estimate variation in percent target
region coverage as a function of sequenced bases aligned to target regions. We used
this tool to fit data for technical replicate and cumulative percent target region
sequenced (span?=?0.75). Using this approach, we predicted a polynomial fit to percent
target bases sequenced as a function of total bases aligned in targeted regions, and
determined the predicted 95 % confidence interval along the fitted line. Results showed
higher standard error at ?20x relative to ?1x (Fig. 1c). In addition, at ?1x we noticed that the fitted line approached saturation as a
function of total bases aligned to target. Taken together, this suggested that lower
thresholds (?1x) had lower variability, and ?20x threshold was highly sensitive to
changes in total bases aligned to target (Fig. 1c, especially when aligned data were below 10 Gb). Given these observations, we investigated
whether higher depth of sequencing would stabilize this effect at ?20x threshold and
repeated this analysis using the aggregate data for each individual. In addition to
sequencing 93.4 – 93.9 % of targeted bases at ?20x, we observed less influence of
input sequence data on the variability of percent target bases sequenced (see predicted
95 % confidence interval at higher depths). Overall, our results supported the conclusion
that current exome sequencing results (mean depth of 100x, 10 Gb aligned data) have
high variability at the ?20x coverage threshold.

Fig. 2. Estimation of stochastic variability between technical replicate targeted sequencing
experiments within the same individual. a Schematic representation of intersection–union test (IUT) on technical replicate
data generated independently in triplicate (R1, R2 and R3). The probability density
function was generated from technical replicate data of a single individual (ID3866)
with least variable input sequence data. The IUT is performed at preset thresholds
to test for low stochastic variability (H 0
) or the alternative hypothesis of high stochastic variability (H 1
) (b) Area-proportional Euler Venn diagram (eulerAPE v3.0) of targeted bases sequenced
in three technical replicates R1, R2 and R3 at ?20x. The square represents the total
targeted bases, and area in white as the total number of targeted bases not sequenced at a given threshold (?20x).
c Area-proportional Euler Venn diagram (eulerAPE v3.0) of targeted bases sequenced
in three technical replicates R1, R2 and R3 at ?1x. The square represents the total
targeted bases (ot proportional relative to the circles), and area in white as the total number of targeted bases not sequenced at a given threshold (?1x)

Stochastics in capture and sequencing can be estimated by replicate libraries

Data from exome sequencing are typically reported as percent targeted bases sequenced
at a given sequencing depth threshold. Although informative for the performance of
targeted sequencing as a whole, this masks the ‘true’ stochastic nature of per-target-base
coverage. In other words, it does not clarify whether a given targeted base achieves
the required minimum depth of sequencing if the capture experiment were to be repeated
independently. To address this, we analyzed the technical replicate data regarding
what fraction of total targeted nucleotides were subject to stochastic genotypeability
and sequencing given comparable, equal input sequence data (Fig. 2a). To investigate the relative stochastic variation in coverage at a per-target-base
level, we grouped the technical replicate samples by individual. Technical replicate
data for all individuals are shown in Additional file 1: Table S1. As proof of principle, we picked the set with the least variable sequence
input data (ID3866). The null expectation (H 0
) was that there would be no appreciable difference between the intersection and the
union of technical replicate sets (Fig. 3a). We observed that 52,212,644 bases (84.3 %) of 61,884,224 targeted bases were sequenced
at ?20x coverage in all three technical replicates (intersection), whereas 3,443,727
(5.5 %; P-value?=?0.0007; two-independent proportion test) targeted bases were sequenced
at ?20x coverage in at least one but not all three technical replicates (union) (Fig.
2b). Similar variation was observed for the three technical replicates of each of the
five other individuals (data not shown).

Fig. 3. Impact of deep sequencing as estimated by aggregate exome sequence data from replicates
in least variable individual. a Area-proportional Euler Venn diagram (eulerAPE v3.0) of targeted bases sequenced
by standard exome sequencing (regular) and aggregate exome sequencing. Sequenced data
are represented within circles of Venn diagram (black numbers), whereas targeted and missed by exome sequencing is represented by the square (red numbers). Left panel represents targeted bases in megabases (Mb), and right panel represents the results as percentage of total targeted bases. b Distribution analysis of number of consecutive targeted bases recovered by deep sequencing
to ?20x. Left panel is a log-log plot of frequency of consecutive targeted bases recovered. Right panel plots the distribution of total number of bases sequenced as a function of consecutive
targeted bases recovered by deep sequencing to ?20x. Red dashed lines represent 95 % confidence interval of loess predicted to the data (blue line). c UCSC genome browser screen shot example of LMNA exon that illustrates the variability of ?20x sequencing along the length of the
exon. The black arrow and red box highlight a known disease causing mutation (c.16C??T; p.Q6X) that is consistently
missed at the ?20x threshold by all three technical replicates, but addressed by aggregate
sequencing. Aggregate data covers the entire exon to ?20x

Because the above results for coverage at ?20x were theoretically dependent on sequencing
input quantity, we repeated the analysis at near ‘predicted’ saturation of capture
and sequencability (?1x coverage). Only 1.9 % of targeted bases were variable among
the technical replicates. Specifically, 58,994,725 bases (95 %) were sequenced at
?1x coverage, whereas 1,203,131 bases (1.9 %) were sequenced in at least one but not
all three technical replicates (not significant) (Fig. 2c). This suggested that stochastic variability among these technical replicates contributes
little to overall sequencability (?1x) although it had an appreciable affect on usable
(?20x coverage) sequence data.

Cumulative technical replicate sequencing improves targeted sequence interpretation

Of the variability within WES target capture regions, 2-3 % arose within protein-coding
regions at ?20x depth of sequencing threshold 26]. To understand whether or not deep sequencing addresses stochastic variability and
benefits achieve theoretical maximum coverage (95 % of targeted bases), we merged
the three technical replicate bam files from each subject to generate a single bam
file (Additional file 1: Table S1). For each aggregate data set, 20–22 Gb high quality reads covered targeted
WES regions; each targeted base had an average of 179 – 208x coverage. Compared to
81.2 – 89.6 % (50.2-55.4 Mb of 61.7 M) coverage of targeted bases at ?20x in the individual
technical replicates, the merged data covered 93.5 – 93.9 % of targeted bases (57.7
– 58.0 of 61.7 M) at ?20x, suggesting a 4.3 – 12.7 % improvement in coverage (Fig.
3a, Table 1). Each individual’s aggregate dataset recovered 3.4 – 6.4 million bases of variable
targeted region.

The distribution of consecutive targeted bases recovered to ?20x sequencing depth
followed a power-law distribution (Fig. 3b). Aggregate data recovered 117,913-170,667 singleton target-base positions (Fig. 3b); the average-size of consecutive bases recovered was ~50 bp (Fig. 3b). In each individual, we identified 17,132 – 40,726 segments greater than 50 bp.
We then intersected regions greater than 50 bp with UCSC known-gene protein-coding
exons. Intersecting UCSC known-gene coding bases with recovered regions revealed that
8,156 – 22,868 regions (0.5 – 2 Mb) overlapped protein-coding regions, including 12–16
of the 56 genes that the American College of Medical Genetics (ACMG) recommended for
return of incidental findings in clinical sequencing 25]. Figure 3c illustrates that current depths of sequencing consistently fail to meet 20x coverage
at clinically important sites. For example c.16C??T variant (p.Q6X; chr1: 156,084,725;
hg19) in LMNA, a cause of autosomal dominant Emery-Dreifuss muscular dystrophy (EMD) 27]. This illustrates how aggregate deep sequencing may help recover variable regions
to 20x or greater depth of coverage. Taken together, this analysis not only revealed
the advantage of technical replicate sequencing to determine exact targeted regions
affected by stochastics under current exome sequencing standards but also demonstrated
the utility of merging the technical replicate data to permit interpretation of regions
with coverage that is otherwise too shallow.

Genotyping sensitivity and accuracy to detect de novo variants improves with cumulative replicate sequencing

We investigated the effect of stochastic variation on genotyping of variants among
technical replicate data sets of the same individual. We included all 18 technical
replicates for this analysis. Since depth of sequencing and the relative proportion
of representation of the alternate allele play a key role in genotype calling 28], we delineated genotype discordances among technical replicates at varying depths
of sequencing. To address this question systematically, we binned each targeted position
into 10-19x, 20-29x, and ?30x bins. We evaluated sites within each of these bins where
genotype calls disagreed between technical replicates of the same individual. In total
we evaluated 19,424,806, 11,888,469, and 75,145,195 positions in the 10-19x, 20-29x,
and ?30x bins, respectively, and found 65 differences. 62 of these differences were
in the 10-19x bin (between technical replicate genotype discordance: 3.2×10
?6
), 3 in the 20-29x bin (between technical replicate genotype discordance: 2.5×10
?6
) and none in the ?30x bin.

To evaluate the accuracy of WES genotyping at the 65 genotype-discordant sites, we
performed Sanger dideoxy chain-termination sequencing. Of the 54 differences for which
we could design functional primers (Additional File 1: Table S3), Sanger sequencing showed that 20 (37 %) were heterozygous; 29 (54 %)
were homozygous reference, and 5 (9 %) were homozygous non-reference. This demonstrated
that 20 heterozygotes were not called in at least one technical replicate in an individual
(disagreement rate: 0.64×10
?6
; 20 in 31,313,275 sites tested), while 29 variant sites were falsely called as heterozygotes
in at least one technical replicate. The sequence surrounding 13 of the 29 sites mapped
to multiple regions of the human reference sequence suggesting that the differences
arose from mis-mapping (Additional file 1: Table S3).

Next, we called de novo variants that arose in second and third generations of the family. In technical replicate
data, for each trio, we found 3–11 (mean: 6.2?±?3) and 1–8 (mean: 3.3?±?2) de novos at 10x and 20x thresholds, respectively. In aggregated data, we found 4–7 (mean:
6?±?1.7) and 3–4 (mean: 4?±?2.3) de novos at 10x and 20x thresholds, respectively. Of the 30 sites for which we could design
effective primers, Sanger sequencing showed that 4 were true de novo variants in technical replicates (Additional file 1: Table S4). Two of the 6 de novo variants identified in the aggregate data were missed in technical replicate data.
The technical replicate and aggregate variants that did not validate by Sanger sequencing
were in regions with problematic GC-content or mappability (Additional file 1: Figure S5 and Table S5) 29]. Taken together, technical replicate data along with aggregate data for a given individual
improved the interpretation of NGS genotype calls compared to current WES standards.