Evaluation of DISCOVAR de novo using a mosquito sample for cost-effective short-read genome assembly

Basic assembly metrics

With DNA from a pool of 38 females, we used DISCOVAR de novo to construct a de novo An. arabiensis assembly (Ddn-Anara). The resulting assembly had a contig N50 of 20,645 bp, and contained about
35 million bases not present in the contigs of the ALLPATHS-LG An. arabiensis reference assembly, AaraD1. (For consistency with the ALLPATHS-LG assembly, scaffolds
shorter than 1 kb and contigs shorter than 200 bp were excluded from calculation of
assembly statistics.) However, its total length was within a million bases of the
length of the AaraD1 scaffolds (Tables 2 and 3).

Table 2. Basic assembly statistics and information for contigs in three An. arabiensis assemblies

Table 3. Basic assembly statistics and information for scaffolds in three An. arabiensis assemblies

Aligning the Ddn-Anara “contigs” (see below) to the contigs of AaraD1 revealed that the Ddn-Anara contigs that did not align were some of the shortest contigs in the assembly
(Additional file 2). Short contigs (those less than 2 kb) were also disproportionately enriched for
repeats/low-complexity sequence (Additional file 3), so we removed them for all downstream analyses. Aligning this trimmed Ddn-Anara assembly to AaraD1 (Fig. 1) indicated that the trimmed assembly still retained nearly all the sequence of AaraD1.
The trimming process also screened for contamination by removing essentially all contigs
unique to Ddn-Anara with significant hits in nt. We further investigated the impact of the trimming
process to quantify the amount of genic sequence removed, and explore other possible
trimming lengths, evaluating assemblies trimmed at five lengths total. We found that
past the cutoff of 2 kb, the number of genes recovered in the assembly declined, as
did the percentage of each gene recovered (Additional file 4) After trimming at the 2 kb cutoff, the contig N50 of the Ddn-Anara assembly was 22,433.

Fig. 1. Alignment of trimmed Ddn-Anara assembly to the AaraD1 assembly. Nucmer alignment of Ddn-Anara contigs to AaraD1 contigs shows that nearly all sequence in the ALLPATHS-LG
assembly is retained in Ddn-Anara, even when contigs shorter than 2 kb are removed

The Ddn-Anara assembly was not formally scaffolded with long-range information from additional
libraries; instead, DISCOVAR de novo introduced 100-bp sequences of Ns to fill small gaps (see Additional files 5 and 6 for true size of filled gaps). Because the vast majority of these gaps were less
than 500 bp, and the total gapped length was less than 200 kb (Additional file 7), we treated the Ddn-Anara “scaffolds” as contigs for trimming and for all downstream analyses. (Calculated
basic assembly statistics, however, maintain the scaffold-contig distinction.) The
scaffold N50 of the (trimmed) Ddn-Anara assembly was 30,033 bp. Basic assembly metrics for the Ddn-Anara assemblies and AaraD1 are shown in Tables 2 and 3.

We also assessed the contiguity of the assemblies made with larger fractions of the
total reads sequenced, before and after removing contigs shorter than 2 kb (Additional
file 8); contig N50 for the assembly made from all sequenced reads was 32,261, nearly 10 kb
longer than that of the assembly made from one half of one lane’s worth of reads and
nearly half the contig N50 of the ALLPATHS-LG assembly. The scaffold N50 for this
assembly was 51,707 bp; while we did not repeat the analysis of gaps (see above) for
the assembly made with all the reads, if the pattern of the lowest-coverage assembly
holds, these “scaffolds” contain only short gaps and function more like contigs. These
basic assembly metrics suggest that significant increases in genome contiguity, if
required, could potentially be obtained by increasing sequencing depth.

Gene recovery

De novo genome assembly is often a prelude to gene annotation or other gene-based analyses.
To quantify the suitability of Ddn-Anara for gene-based approaches, we searched the assemblies for members of a set
of universal, single-copy arthropod genes (BUSCOs, 26]) in Ddn-Anara, the contigs and scaffolds of AaraD1, and a transcriptome previously assembled
with Trinity 17]. The BUSCOs were found with nearly equal completeness and contiguity in the contigs
of AaraD1 and Ddn-Anara, while BUSCOs found in the AaraD1 scaffolds were more complete and contiguous
than in any set of contigs. Nearly all the BUSCOs were present in the Trinity assembly,
but they were less complete, and more fragmented (Fig. 2a). Mean percent recovery from the AaraD1 contigs was 98.1 %, with each gene, on average,
in 1.004 pieces; for the AaraD1 scaffolds, an average of 99.0 % of each gene was recovered,
in 1.001 pieces. For Ddn-Anara, the corresponding numbers were 95.9 % and 1.014 pieces. Increasing sequencing
coverage resulted in modest improvements in the quality of gene models found in the
DISCOVAR de novo-built assemblies: on average, 96.3 % of each gene was recovered from the whole-lane
assembly, in 1.01 pieces. The assembly made with all reads recovered an average of
96.8 % of each gene, in 1.009 pieces.

Fig. 2. Analyzing the quality of 5 An. arabiensis genome or transcriptome assemblies in terms of gene content. We looked for three
categories of genes in the Trinity transcriptome, Ddn-Anara (labeled here as the 126× DISCOVAR de novo assembly), the higher-coverage DISCOVAR de novo assemblies, the ALLPATHS-LG AaraD1 contigs, and the ALLPATHS-LG AaraD1 scaffolds.
a Benchmarking universal single-copy orthologs (BUSCOs) from the An. gambiae PEST genome. b All PEST genes, excluding 5? and 3? untranslated regions (UTRs). The result of using
whole genes can be seen in the genes not recovered from the Trinity assembly. c Cuticle protein genes from low-complexity gene families identified in Cornman et
al. 2009

When using nucleotide sequences of all protein-coding genes (AgamP4.2) from An. gambiae, a member of the same complex as An. arabiensis (Fig. 2b), differences between the AaraD1 contigs and scaffolds diminished. Mean percent recovery
for AaraD1 contigs was 95.1 %, with each gene in an average of 1.50 pieces, while
mean percent recovery for the scaffolds was 95.2 %, with the same contiguity. Mean
percent recovery for Ddn-Anara was 94.1 %, with each gene in an average of 1.56 pieces. Increasing the number
of reads provided to DISCOVAR de novo again resulted in very modest improvements in completeness, but not contiguity; mean
percent recovery for the one-lane assembly was 94.4 %, with each gene in, on average,
1.56 pieces, and for the assembly made with all reads, the corresponding numbers were
94.7 % and 1.556 pieces. For both gene sets, Ddn-Anara closely approached not only the AaraD1 contigs but also the AaraD1 scaffolds
in contiguity and completeness.

Low-complexity sequence

As expected from their respective contig N50s and general levels of contiguity, there
are approximately twice as many instances of AaraD1 contigs spanning multiple Ddn-Anara contigs (1905 contigs) than the reverse (1066 contigs) (Fig. 3a). However, Ddn-Anara contigs spanning multiple adjacent AaraD1 contigs, though relatively short,
seemed to concentrate in the centromeric regions. In these same regions, which tend
to have high concentrations of low-complexity sequence, AaraD1 contigs spanning multiple
Ddn-Anara contigs occurred less frequently than in the rest of the genome. The sequence
found in the Ddn-Anara assembly between adjacent AaraD1 contigs was enriched for low-complexity sequence
compared to the rest of the genome (Additional file 3), suggesting that DISCOVAR de novo might have an advantage over ALLPATHS-LG in assembling low-complexity sequence. One
example of a Ddn-Anara contig spanning seven adjacent AaraD1 contigs is shown in Fig. 3b.

Fig. 3. Assessing regions where one assembler outperforms the other. aAn. gambiae chromosomal coordinates, and alignment length in that coordinate system, of AaraD1
contigs spanning multiple Ddn-Anara contigs (oriented downward) and Ddn-Anara contigs spanning multiple AaraD1 contigs (oriented upward). Centromeres are marked with black circles. Chromosomal inversion 2La, which is fixed in An. arabiensis but polymorphic in An. gambiae, is indicated by a grey box. b Close-up of a Ddn-Anara contig spanning the entirety of eight AaraD1 contigs, at approximately 24.1 Mb
on the X chromosome

We also saw this trend between the two assemblers when we focused on low-complexity
cuticle protein genes identified in 31] (Fig. 2c). As in the whole gene set, AaraD1 contigs recovered more of each low-complexity
gene than did Ddn-Anara (a mean of 98.4 % for AaraD1 contigs and scaffolds, and 95.8 % for Ddn-Anara); however, Ddn-Anara recovered the low-complexity genes in fewer pieces than AaraD1 (an average
of 1.96 pieces for AaraD1 contigs, 1.87 pieces for AaraD1 scaffolds, and 1.60 pieces
for Ddn-Anara). Together, these observations suggest that DISCOVAR de novo may have an advantage over ALLPATHS-LG in low-complexity regions.

We also examined performance in regions of low complexity in the assemblies made from
one and two lanes of data. The assembly made by DISCOVAR de novo from one lane recovered low-complexity genes with better contiguity and completeness
than any other assembly; an average of 98.4 % of each gene model was present in this
assembly, in an average of 1.52 pieces. (Corresponding numbers from the assembly made
with two lanes are 97.7 % and 1.57 pieces.) Increasing sequencing depth increased
the number of DISCOVAR de novo-generated contigs spanning multiple AaraD1 contigs, and decreased the number of AaraD1
contigs spanning multiple DISCOVAR de novo-generated contigs (Additional file 9). The degree of improvement in centromeric regions was rather stochastic, but was
most noticeable in the centromeric region of the X chromosome (Additional file 10).

Genome-wide patterns of polymorphism

We characterized variation in the heterozygosity of our template by mapping sequencing
reads back to the assembled DISCOVAR de novo contigs and calling variants. We found that while 39 % of the Ddn-Anara assembly
is represented by contigs exhibiting no variants, the other 61 % of assembly exhibits
a mean heterozygosity rate of 3.66 SNPs/kb, or 1 SNP every 273 bases. This rate of
heterozygosity is approximately four times as high as that in humans 32]. This profile of partitioned heterozygosity, a product of the inbreeding history
and pooled composition of the sequencing template, allows us to examine the performance
of DISCOVAR de novo in both monomorphic and polymorphic compartments of the An. arabiensis genome.

We also analyzed the variants predicted in the unflattened version of the assembly.
The largest category of “bubbles” present within the assembly, before flattening,
were 1 bp long, or traditional SNPs (Additional file 11). Contigs containing these bubbles, as well as contigs containing simple indels,
concentrated near the centromeres of the autosomes, but were found throughout the
genome (Fig. 4). To avoid overplotting, we identified and removed putative separately assembled
haplotypes, contigs that aligned very well to another, similarly-sized contig over
nearly all the length of both contigs. (While these contigs were shorter than the
set of all contigs as a whole (Additional file 12), they showed levels of repetitive sequence similar to the rest of the genome (Additional
file 3); this suggests that the similarity within each pair of contigs was not due to extended
regions of low complexity, which further suggests that these contigs represent separately
assembled haplotypes.) For many insect species, it may be necessary to pool samples
to obtain sufficient DNA for library preparation (generally a minimum of 0.5 ?g);
in those cases, this approach of analyzing haplotype data may be generalizable to
the identification and visualization of structural variation.

Fig. 4. Genome-wide polymorphism. Locations, in PEST coordinates, of Ddn-Anara contigs containing “bubbles” of exactly one SNP or exactly one indel. SNPs
are designated by points at the top of the plot; indels are designated by vertical lines, anchored at the x-axis, corresponding to the length of the indel. Both types of
variants are colored by the length in PEST coordinates of the contig on which they
are found. Centromeres are marked with black circles