An efficient approach to BAC based assembly of complex genomes

Determination of the optimal sequencing depth for BAC assembly

To determine the level of sequence coverage for accurate BAC assembly, eleven individual
sugarcane BACs from the sugarcane cultivar R570 BAC library 20] were sequenced to extremely high coverage (3000x). Reads were split into subsets
representing 200x–3000x coverage, with 100x increments. The subsets were assembled
with SASSY 21] (https://github.com/minillinim/SaSSY), which is an assembler customised for the assembly of complex repetitive BACs. Assemblies
had an average N50 of 52 Kb and average number of contigs per BAC of 5.2 (Additional
file 1: Table S1). For each of the BACs, assembly length increased until around 450x, then
levelled off until 900x (Fig. 1). This suggests that 450x coverage is required for optimal BAC assembly, consistent
with previous findings 21] in which the SASSY assembler was demonstrated to require a relatively large amount
of data. The variation in assembly length observed for datasets greater than 900x
(Fig. 1) is likely to be due to the increase in number of erroneous reads confounding the
assembly process.

Fig. 1. Optimal coverage for assembly. Assembly sizes vs coverage for each of the 11 sugarcane
BACs. Assembly sizes peak at 450x and level off despite increase in coverage beyond
1500x

Assessing the accuracy of BAC pooled assemblies

Even with the high degree of indexing available with Illumina DNA sequencing methods,
the sequencing of individual BACs remains expensive. A pooling strategy was consequently
established to increase throughput and reduce costs. The number of BACs which can
be sequenced in a single lane of Illumina HiSeq 2000 is determined by the coverage
required (450x–900x), the mean BAC length (around 120 Kb) and the data volume from
the Illumina HiSeq (around 40 Gbp per lane). This suggests that pooling 384 BACs within
a single lane, with accurate quantification and normalisation should produce around
850x coverage for each BAC. Considering that BAC DNA is likely to contain some contamination
with Escherichia coli genomic sequence, the actual sequence coverage is likely to be less than this and
fitting well within the range of 450x–900x shown to produce optimal assemblies.

To assess the accuracy of assembling bread wheat BACs in pools, single BACs were assembled
and compared to the same BACs assembled as pools. Seven non-overlapping bread wheat
BACs from chromosome 7DS were sequenced resulting in a sequence coverage range of
709x–1041x and a mean of 844x (Additional file 1: Table S2). After E. coli and vector sequences were filtered, sequence coverage ranged from 519x to 773x with
a mean of 658x. Assemblies of the seven individual BACs (A, B, C, E, F, G and H) had
an average N50 of 78 Kb with an average of four contigs per BAC (Table 1). Two BACs, B and G assembled as a single contig. Assemblies of pooled BACs (ABCE,
BCEF, CEFG and EFGH) (Table 1) had an average N50 of 41 Kb with an average of 5.3 contigs per BAC.

Table 1. Assembly statistics of seven single bread wheat BACs and simulated BAC pool assemblies

A sequence comparison of contigs from individually assembled BACs (Additional file
1: Table S3) showed the integrity of individual BAC assemblies in pooled assemblies
was maintained and assemblies of BAC pools remained collinear with those of individual
BAC assemblies (Fig. 2). Pooled assemblies were further validated by comparison with their Sanger sequenced
BAC ends. Mappings of BAC ends showed individual BACs remained separate in a pooled
assembly (Fig. 3).

Fig. 2. Mummer plot of assemblies of single BACs A, B, C, E against pooled BACs of ABCE

Fig. 3. BES mappings on contigs of simulated pool (ABCE). Clones A, C and E have forward (M13_For)
and reverse (SP6_Rev) BES (A01_M13_For, A01_SP6_Rev, C01_M13_For, C01_SP6_Rev, E01_M13_For,
E01_SP6_Rev) respectively correctly mapped. Clone B had no BES available but 120 bp
sequences from cloning vector ends (FOR and REV) were used to identify contig ends
of clone B

High throughput wheat BAC assembly

Following an assessment of the sequencing depth and pooling strategy, 96 BAC pools,
each representing four randomly selected BACs from a bread wheat 7DS BAC library 22] were indexed and sequenced using a single lane of Illumina HiSeq 2000. E coli and vector sequences were removed resulting in a mean coverage per BAC of 690x with
a range of between 184x and 889x. Only 3 % (12/384) of the BACs had coverage below
490x. Data from BAC pools was assembled using SASSY. The resulting assemblies (Table 2) had a mean N50 of 80 Kb, with an average of 2.7 contigs per BAC (Fig. 4). An average of 2.7 contigs per BAC for 96 pools compared to 5.3 contigs per BAC
for four BAC pools (ABCE, BCEF, CEFG and EFGH) (Table 1) was lower and more accurate as a result of the higher number of BACs assembled.
Of all the BACs, 99.5 % (382/384) had seven contigs or less per BAC, while 75 % of
the BACs (288/384) had three contigs or less per BAC (Fig. 4; Additional file 1: Table S3). Assemblies were further improved by scaffolding with mate pair (MP) reads.
Scaffolding resulted in an increase in N50 from 80 to 106 Kb. The average number of
contigs per BAC after scaffolding was reduced from 2.7 to 1.5 (Fig. 4). After scaffolding, 99.5 % (382/384) of the BACs had four scaffolds or less per
BAC (Fig. 4; Additional file 1: Table S4), while 75 % of the BACs (288/384) had two scaffolds or less per BAC (Fig. 4; Additional file 1: Table S4).

Table 2. Mate pair mapping orientations on E. coli, contigs and scaffolds

Fig. 4. Distribution of no of contigs and scaffolds per BAC for 96 BAC pools

Paired read orientations and insert sizes of MP reads mapped to E coli and the 96 pool assemblies showed 99 % of the MP reads mapped with the expected MP
orientation (RF) and expected insert size of 6 Kb (Fig. 5; Table 2). Scaffolds of the 96 pools had 97 % of the MP reads mapping with the expected MP
orientation (RF) and expected insert size of 6 Kb. Shadow library and chimeric MP
mapping orientations (FR), (FF and RR) respectively were altogether 3 % (Table 2) on both E coli, the 96 pool assemblies and scaffolds. This was within the expected values for Illumina
Nextera MP libraries of ~2 % 23]. This suggests the contiguity of the assemblies is accurate.

Fig. 5. Distribution and orientation of MP insert sizes on E coli (a), contigs (b) and scaffolds (c) of 96 wheat BAC pools. Y axis (MP read counts in log scale), X axis (insert sizes). Correctly orientated MP reads with orientation RF ( –, – ) are
shown in green, shadow library MP reads mapping with orientation FR (– , –) are shown in orange
and chimeric MP reads mapping with orientation FF (– , – ) and RR ( –, –) are
shown in blue

A comparison of assembly sizes of the 96 pooled BACs to that of the sum of their corresponding
individual BAC sizes estimated by fingerprinted contigs (FPC) software (Table 2) showed the average assembly size for a pool of four BACs was 441 Kb while the average
predicted FPC size was 440 Kb. It is expected that assembly size and FPC size estimates
would not be equal as repeats influence assembly sizes while FPC size estimates are
approximations derived from the number of visualized restriction fragments. Despite
this, paired t tests showed there was no significant difference between the FPC sizes
and assembly sizes of the 96 pooled BACs (t = ?0.14, df = 95, p  0.8870).

While previous studies in barley recommended the use of read lengths 600 bp sequenced
by Roche/454 24], no current studies have demonstrated accurate robust pooled BAC assemblies using
Illumina short reads in wheat. Our results show accurate assemblies of highly repetitive
and complex genomes can be achieved using Illumina short reads with 3 % chimeric
assemblies compared to previous estimates of 24–47 % using Roche/454 24].