The next generation of target capture technologies

Next-Generation Sequencing (NGS) technology has forever transformed the field of genetics, enabling large-scale, high throughput genetic studies for a variety of research and diagnostic applications. While economically sequencing entire genomes remains an important goal of NGS, many research and diagnostic applications are best achieved through targeted DNA sequencing of specific genomic loci. Targeted DNA sequencing is advantageous not only because it is more cost effective, as it facilitates higher sample throughput than whole genome sequencing, but also because it improves accuracy by optimizing the read depth coverage and by reducing the complexity of the DNA to be sequenced.

Several methods have been developed for the targeted enrichment of genomic DNA [14] for a variety of clinical and research applications [511]. They are typically based upon a multiplexed PCR amplification reaction [12], DNA hybridization to a capture oligonucleotide (either on an array or in solution) [1315] or DNA capture via molecular inversion probe circularization [16, 17]. Regardless of the method employed, all of these DNA enrichment methods rely heavily on fragmentation of genomic DNA prior to amplification, resulting in relatively short (less than 1000 base-pair) sequencing templates. As a result, existing methods for genomic partitioning remain a severely limiting factor for comprehensively characterizing complex genomic loci because they cannot provide the larger size fragments that are required to successfully span confounding sequence elements, such as extended repeats, or resolve sections of unknown or unexpected sequence that have been inserted or rearranged within the targeted region [18, 19].

Importantly, such large DNA templates can now be utilized by the newer, “third generation” sequencing platforms which are capable of producing significantly larger read lengths [2022] and sequencing through traditionally difficult sequence templates with high GC content [23]. The longer read lengths produced by these platforms have been shown to be highly advantageous in characterizing structural variants, haplotype phasing within complex genomic loci and de novo genome assembly [22, 2426].

Our DNA enrichment method, Region Specific Extraction (RSE), addresses this unmet need by capturing long DNA fragments of???20 kbp in length. RSE utilizes a single primer extension step for capture in which standard oligonucleotides (?20 bases in length) hybridize to highly specific sequence motifs within the targeted region(s) and are enzymatically extended to include biotinylated nucleotides within the nascent DNA strand. The targeted genomic DNA segments are then pulled down using streptavidin-coated magnetic particles, which bind to the newly synthesized biotinylated DNA sections. These biotinylated portions represent a small percentage of the overall extracted DNA and do not pose a challenge to the efficiency of library preparation and sequencing. The captured segments of the original genomic DNA template, which extend far into both directions from any single point where a capture primer has been hybridized, are then typically amplified by whole genome amplification and processed by standard NGS sequencing protocols (Fig. 1).

https://static-content.springer.com/image/art%3A10.1186%2Fs12864-016-2836-6/MediaObjects/12864_2016_2836_Fig1_HTML.gif
Fig. 1

Principle of RSE. a During the first step of RSE, the genomic template DNA (light blue) briefly gets denatured to allow capture primers (red) to hybridize. b The bound primers are enzymatically extended with biotinylated nucleotides. The extended portions of the primers, shown in green, form the “handle” to which streptavidin-coated magnetic beads bind. During this process many biotins of the same primer/target DNA complex are bound to streptavidin binding sites on the same bead, thereby forming a topological linkage that firmly locks even very long DNA segments extending in both directions from the capture point onto the surface of the magnetic bead. The primer/target DNA complex is then magnetically purified and released from the bead surface by heat. (The drawing is not to scale: the magnetic beads are approximately an order of magnitude larger than illustrated here)

A specific program (Antholigo; see “Methods”) we developed for the primer design can be instructed to position the primers at variable distances from their nearest neighbors. If desired, this distance can be 8-10 kbp or greater in order to minimize the number of primers used, while providing for optimal coverage of the targeted region. RSE is simple to use and requires no fragmentation of the genomic sample prior to capture, as other enrichment technologies do. Although the typical size of captured fragments in this study was about 20 kbp, the same principle has been used to extract significantly larger segments depending on DNA quality and the method used for its extraction [27].

Here we demonstrate the utility of RSE for the targeted sequencing of the most complex region of the human genome, the major histocompatibility complex (MHC; HG19 coordinates chr6:29618227-33618227) on the short arm of chromosome 6. The MHC is known to be the most gene-dense region of the human genome, with many transcribed genes playing an important role in innate and adaptive immune processes [28]. Consequently, numerous loci throughout the MHC have been associated with immune-mediated diseases [2932]. The MHC contains dozens of highly polymorphic genes and large regions of duplication and repetitive elements [28]. Interestingly, despite its significance, there are only two completely characterized MHC haplotypes from two homozygous B cell lines namely PGF (the reference sequence for the MHC in the reference human genome) and COX [3336]. The same region of the MHC and of the same cell line PGF has been targeted by other capture technologies [15] and offers a unique opportunity for comparisons that demonstrate the advantages of RSE. Eventually this technology can contribute greatly to the comprehensive characterization of such difficult regions around the genome by providing both accurate sequencing and description of structural variations including deletions, insertions and duplications.