De novo assembly of Dekkera bruxellensis: a multi technology approach using short and long-read sequencing and optical mapping


De novo assembly

A total of seven assemblies were generated using Illumina and PacBio sequencing data
(for a summary of computational resources used see Additional file 1: Table S4). We used these two data sets both in isolation and combined. To generate
assemblies from only Illumina reads, we used ALLPATHS-LG 52], ABySS 47], and SOAPdenovo 53]. For assembly of PacBio reads only, HGAP 8] and FALCON 54] were used. Illumina-PacBio hybrid assemblies were generated by AHA 55] and CABOG (using pacBioToCA error correction by Illumina reads) 56]. For assemblers using a De Bruijn Graph method with a mandatory k-mer size parameter
(ABySS and SOAPdenovo), we tested a range of k when running SOAPdenovo, and found
k?=?61 to be optimal (see Additional file 1).

We computed standard contiguity metrics (Table 1) for all assemblies. Table 1 shows that ALLPATHS-LG gave the most well connected Illumina assembly, i.e., greater
N50 and fewer but longer contigs. In comparison, the ABySS assembly had the lowest
N50 number and more numerous but shorter contigs. In terms of N50, the SOAPdenovo
assembly can be regarded as being better connected than the ABySS assembly; however,
a large majority of the assembly consists of contigs less than 1 kbp in length. When
considering PacBio only assemblies, the most connected assembly is the one produced
by HGAP, which has an N50 four times shorter than that produced by ALLPATHS-LG. FALCON
performed noticeably worse than HGAP, with a much lower assembly length (see Table 1) and a lower N50. However, FALCON is experimental and might not be suitable for the
input data, and/or it was used with non-optimal parameters. AHA fared best among the
hybrid-assemblies.

Table 1. Standard contiguity metrics

In the absence of a reference sequence, it is difficult, if not impossible, to determine
the assembly that is most representative for the underlying genome based on the standard
contiguity metrics alone. We ran CEGMA on all assemblies to evaluate their gene space
(see Fig. 5 and section below for more details). However, CEGMA only helped us to identify SOAPdenovo,
FALCON, and AHA as outliers. The remaining five assemblies contained a similar number
of core genes. We decided to use FRC analysis to evaluate our assemblies, used in
a similar way to that used for the Norway spruce genome 7] and GAM-NGS studies 57]. The cumulative feature curves (Fig. 2) confirmed the poor performance of the less connected assemblies produced by ABySS
and FALCON. FRC did, however, overturn the contiguity metrics for most connected assemblies:
ALLPATHS-LG and HGAP. FRC also reshaped the order of PacBio assemblers pacBioToCa
and HGAP. ALLPATHS-LG was not only the best Illumina assembler, but also generated
the assembly with fewest features, i.e., areas of suspected mis-assembly. However,
Fig. 2 shows that HGAP was able to cover more of the genome while introducing fewer features.
Clearly, the long ALLPATHS-LG contigs accumulate more features than the shorter HGAP
contigs, e.g., with 2000 features we were able to cover more than 60 % of HGAP assembly
but ‘only’ 50 % of that assembled by ALLPATH-LG. This might suggest that the long
ALLPATH-LG contigs are the result of a too-eager assembly strategy (see Fig. 3 and Additional file 1: Figure S2). Remarkably AHA, one of the better connected assemblies, performed much
worse than pacBioToCA because of it had a high number of compressed repeat features
(Additional file 1: Figure S3).

Fig. 2. Feature response curves. Feature response curves (FRC) for assemblies considered for
optical map placement. On the x-axis is the total number of features normalised for
the assembly contig count. On the y-axis is the coverage based on the estimated genome
size of 14,719,721 bp (size of the first completed assembly, HGAP)

Fig. 3. Placement of ap_contig1 to optical map Chromosome 1. An illustration re-drawn from
the output of the OpGen’s Mapsolver software, where in silico digested allpaths-lg
contigs are placed to the optical map Chromosome 1. It shows a complex rearrangement
where flaws in the allpaths-lg assembly are corrected. The 1.38 Mbp region A´ of ap_contig1 is a collapsed repeat structure, which the optical map was able to
resolve and subsequently could be placed to regions a 1
and a 2
of Chromosome 1. This map placement is highlighted in transparent red for clarity
and shows that the sequences were placed in inversed orientation. Furthermore, a 2
and a 3
are flanking the placed sequence b 1
, originating from the B region of the contig ap_contig6. On the left flank of B is an unplaced region whose restriction enzyme cuts could not be aligned to the cuts
made by the Argus system, and is likely the result of mis-assembly

After a careful analysis of contiguity metrics, CEGMA hits, FRC curves, and coverage
plots (Additional file 1: Figure S4 and S5) automatically produced by the NouGAT, we deemed ALLPATHS-LG and
HGAP to have produced the best assemblies. Consequently, we chose them for optical
map placement.

Optical map placement

From the OpGen imaging and data processing steps, seven optical maps were obtained,
named Chromosome 1, Chromosome 2, etc., spanning about 16.79 Mbp in total. This is
an impressive result compared with the 308 and 351 unordered contigs generated by
HGAP and ALLPATHS-LG, and with this critical information we were able to both spatially
resolve the D. bruxellensis genome and to error correct de novo assembled contigs. Using OpGen’s MapSolver software
to digest in silico assembled sequences and placement on optical maps, we devised
the following strategy: first cover the maps using ALLPATHS-LG contigs of minimum
40 kbp length (shorter fragments cannot be placed as they do not have enough in silico
restriction enzyme cuts), and then fill in any remaining gaps using HGAP assembled
contigs. Using this method we were able to cover 87 % with contigs, with the remaining
unplaced ALLPATHS-LG contigs included as ‘unknown’ sequences.

An interesting feature of note is represented in Fig. 3. In this figure we can clearly appreciate the potentiality of optical mapping when
it comes to finishing and error correcting draft assemblies. Chromosome 1 has been
assembled to a single restriction map using optical mapping. The figure represents
a complex repeat structure, shown schematically as three sequences labelled a
1
, a
2
, b
1
, and a
3
, with a
2
and a
3
containing an identical repeat the size of approximately 434 kbp. Thanks to the longer
fragment lengths utilised by this method, a complex repeat structure has been resolved
(contained in regions a
2
and a
3
). Neither ALLPATHS-LG nor HGAP (i.e., neither Illumina nor PacBio) alone has been able to correctly reconstruct such a
complex scenario. HGAP resulted in 13 small contigs partially covering regions a 2
and a 3
, one of which is placed in both (see Additional file 1: Figure S2). ALLPATHS-LG has been able to produce an extremely long contig, likely
using the information inferred from the longest mate-pair library. However, Fig. 3 clearly demonstrates that the long contig, ap_contig1, is the result of wrong decisions
made during scaffolding; not only that a complex repeat is collapsed to a single copy,
but a 545 kbp region is absent and placed in a different contig (region B of ap_contig6). This scenario clearly shows the additional value added by optical
maps and the importance of being mindful when presented with long contigs generated
from relatively short DNA fragments.

To represent the haploid genome (in the style of a reference genome), we had concerns
about the maps for Chromosomes 7, 6 and later 5, since all the ALLPATHS-LG contigs
placed therein were duplicates of ones found in the first four maps. The maps for
chr5–7 were considerably smaller in size than those preceeding. Furthermore, Mapsolver
showed large map-to-map alignments between these two groups (chr1–4 to chr5–7), which
strongly suggests that these regions are recombinations.

To test how well chr5–7 are supported by the sequencing data we generated two map-placed
consensus sequences: one consisting of sequences for chr1–7 and another of sequences
chr1–4. These were processed by the assembly evaluation pipeline, and the feature
response curves (Fig. 4) clearly indicated that the assembly for chr1–4 is the best performing assembly,
which it owes mainly to the reduction of low coverage regions when the Illumina reads
are mapped. It also becomes obvious that chr1–4 is able to cover more of the genome
than HGAP (the best performing assembly), while introducing fewer features: approximately
4900 in chr1–4 compared with 5800 in HGAP.

Fig. 4. Total and low coverage feature response curves. The total feature response curves
(a) only shown for HGAP, allpaths, chr1–7 and chr1–4. The decreased number of features
when removing Chromosomes 7, 6 and 5 is mostly attributed to regions of low read coverage
(b)

Validation using CEGMA

As an extra validation step we ran CEGMA 58], which maps the assembled sequences to a set of 458 highly conserved eukaryotic genes.
For the 248 most extremely conserved genes, alignments to the queried assembly are
classified as ‘complete’ or ‘partial’ depending on a fixed alignment length threshold.
Of the total number of CEGMA hits, allpaths and HGAP performed equally with 246 hits
of which one is a partial hit. While the results from CEGMA were not, in our case,
essential to the evaluation of the assemblies (over 95 % completion for most assemblies,
Additional file 1: Table S1), two observations are remarkable. First, FALCON and abyss, which we earlier
established as ‘poor’, are reflected in these results by having a lower completion
rate. Second, the final Dekkera assembly (chr1–4) received a total of 240 hits, of which three are partial hits (Fig. 5) retaining most of the core genes in an ordered and oriented manner. Further evidence
of chr5–7 being artifacts of mis-assembly is the fact that excluding these did not
reduce the total number of hits, only a partial loss of one hit. This can also be
seen by the higher percentage of orthologous hits in chr5–7 (Additional file 1: Table S1).

Fig. 5. Reported CEGMA gene hits. Barchart showing the number of hits to a set of 248 extremely
conserved eukaryotic genes, as reported by CEGMA. Classified as either ‘complete’
or ‘partial’, depending on the alignment percentage

Genome completion using PacBio

We carefully investigated the proportion of optical maps that is assembled exclusively
by HGAP. In other words, we wanted to check what we gain by combining Illumina and
PacBio assemblies. HGAP contigs were able to add 487 kbp of new sequences, which ALLPATHS-LG
was not able to reconstruct. Moreover, 363 kbp out of 532 kbp of ambiguous sequences
(gaps and ambiguous base calls) could be replaced using the sequencing information
from HGAP contigs. In total, the PacBio data allowed us to resolve slightly more than
5 % (Additional file 1: Table S3) of additional genomic content. We believe that, when automated, this presents
an effective strategy for genome finishing.