Technological considerations for genome-guided diagnosis and management of cancer


Just as selection of the “test” is dictated by intended use, the choice of sequencing technology (or platform) is also an important consideration. Although there is less dimensionality in the sequencing landscape today, with Illumina (San Diego, CA, USA) capturing most of the application space, the complexity, scale, cost, and required throughput of the test are important factors in determining the optimal platform.

The required read length and generation of paired end reads are a primary consideration. Read length is an important factor that relates to the type of genomic alteration events that might be queried and the overall accuracy of the placement of sequence reads relative to the target. In general, the most commonly used massively parallel sequencing platforms today generate short reads of a few hundred bases. This includes Illumina platforms (MiniSeq 2?×?150 bases, MiSeq 2?×?300 bases, NextSeq 2?×?150 bases, and HiSeq series 2?×?150 bases), also the Thermo (Waltham, MA, USA) Ion Torrent platform (Proton 1?×?200 bases), and the Qiagen (Hilden, Germany) GeneReader (100 bases). The utility of reads of this length is related to the type of assay being performed. For example, for amplicon sequencing (using “hotspot” panels), in general short read sequencing matches the size of the amplicon, and the amplicons can be designed such that the hotspot itself is located at a position where high quality can be expected (that is, not at the end of a read). Reads of a hundred or so bases are also useful for short variant detection using targeted sequencing of a gene panel or exome or in WGS. Similarly, for FFPE or cfDNA-derived materials, template lengths are generally shorter, so read lengths in the low hundreds of bases are appropriate.

Paired-end sequencing, which refers to sequencing a DNA fragment from both ends (the forward and reverse reads may or may not overlap), increases the utility of short reads in two ways. Some types of structural variation can be detected when the pairs of reads align to the genome in an unexpected way [61]. Sequencing both ends of fragments can also allow “de-duplication” in deep sequencing, where the occurrence of fragments with the exact same ends can be used to mask some reads as molecular duplicates, thus not adding to library complexity (for example, the MarkDuplicates tool in Picard [62]).

The main limitation of short reads (even if paired end) is in the discovery of fusion events or structural variation. Detection of known fusion events can be enabled by targeted assays that increase the utility of short reads by requiring mapping to a small or predefined event. Alternatively, specialized library construction methods to create long insert mate-paired libraries have shown some successes in structural variation detection [63]. For discovery of novel rearrangements, the most powerful approach involves long reads in which fusion or rearrangement events are spanned within the read. Options here include Pacific Bioscience (Menlo Park, CA, USA) instruments that generate reads of thousands of bases or the use of approaches such as the 10X Genomics platform, which links together short reads using a molecular barcoding approach. Another platform under active development in the long read space is the nanopore-based sequencing technology commercialized by Oxford Nanopore (Oxford, UK).

Ideally, the generation of very long reads would cost the same as an equal coverage of short reads, but this is not the case. Most dramatic decreases in sequencing cost have come from the platforms that generate short reads. For example, release of the Illumina HiSeqX decreased cost by threefold compared to the HiSeq2500: sequencing of a 30× human genome cost approximately $1500 on the HiSeqX compared to $5000 on the HiSeq2500. Sequencing the whole genome with long reads on a platform such as Pac Bio is cost prohibitive in most settings, at $20,000–80,000 per sample. In general, long read sequencing is used to sequence smaller (such as microbial) genomes or to target complex regions of the human genome (such as human leukocyte antigen genes) that are intractable for short read sequencing.

Short read sequencing costs vary considerably by platform, based on the instrument yield. For example, the lowest cost per Gb (billion bases) on a short read sequencer is approximately $15/Gb on the HiSeqX platform with an output of 1800 Gb bases per run. This level of throughput is appropriate for WGS which requires at least 100 Gb of data per sample, or considerably higher for tumor sequencing. Lower throughput platforms such as the MiSeq and HiSeq 2500 cost considerably more per Gb ($200/Gb and $45/Gb, respectively) but have an output per run (15 Gb for MiSeq, 1000–1500 Gb for HiSeq 2500) more appropriate for smaller scale sequencing, such as the panel test. A panel test of 100–200 genes might require 0.5–1 Gb per sample. Platform selection for this level of sequencing is a balancing act between the competing pressures of cost and turnaround time. To run most efficiently, multiple samples would be indexed, pooled, and sequenced on enough lanes to achieve the desired coverage. In practice, in the clinical testing world, the need for more rapid turnaround times necessitates running incomplete, and thus more expensive, batches. Technical features, such as template preparation techniques, sequencing chemistry, and error profiles are also important considerations. A review of technical differentiators is presented by Goodwin et al. [64].