Clinical evaluation of panel testing by next-generation sequencing (NGS) for gene mutations in myeloid neoplasms

Patient samples

Fifty patient samples were accrued from April 2011 to November 2014. They comprised
19 males and 31 females at a median age of 60 years (range 18–88 years). The diagnoses
were M0?=?1, M1?=?7, M2?=?12, M3?=?1, M4?=?5, M5?=?3, M6?=?2, not otherwise specified
(NOS)?=?4, AML with myelodysplasia (MDS) related changes (AML-TMDS)?=?5, AML transformed
from MDS or MDS/MPD?=?3, therapy-related AML?=?1, refractory AML?=?2, high-grade MDS?=?3
and atypical chronic myeloid leukemia?=?1. A diagnosis of AML-NOS was rendered in
four patients due to diagnosis on peripheral blood only (n?=?2), aparticulate bone
marrow aspirate (n?=?1) and diagnosis at another institution (n?=?1). The study was approved by the Research Ethics Committee of Hong Kong Sanatorium
Hospital (reference number: REC-2015-02).

DNA was extracted from the corresponding specimens (bone marrow?=?42 and peripheral
blood?=?8) using QIAamp DNA Blood Mini Kit (Qiagen, Germany). Four additional samples
were available from 3 patients for comparison (buccal swab and second peripheral blood
samples of patient 31, post-treatment sample of patient 35 and relapsed sample of
patient 36). Extracted DNA was quantified using Qubit dsDNA BR Assay Kit and Qubit
2.0 Fluorometer (Life Technologies, USA).

Any known mutation status of FLT3, NPM1, KIT, CALR, MPL, CSF3R by Sanger sequencing and JAK2 V617F by allele-specific polymerase chain reaction (PCR) and PCR-restriction fragment
length polymorphism (PCR-RFLP) analysis of the patient samples were curated from the
clinical record for comparison with NGS results.

Control samples

Archival DNA samples (n?=?11) with known mutations in CALR (n?=?6), JAK2 (n?=?2), KIT (n?=?1), MYD88 (n?=?1) and TP53 (n?=?1) were retrieved as positive controls to validate the NGS myeloid panel. DNA extracted
from peripheral blood samples of healthy adults with normal complete blood profile
(n?=?18) were accrued in July and October 2014 as negative controls. These control samples
were analyzed in the same way as the patient samples.

NGS myeloid gene panel

The myeloid gene panel targets 54 genes (full coding exons of 15 genes: BCOR, BCORL1, CDKN2A, CEBPA, CUX1, DNMT3A, ETV6/TEL, EZH2, KDM6A, IKZF1, PHF6, RAD21, RUNX1/AML1, STAG2 and ZRSR2, and exonic hotspots of 39 genes: ABL1, ASXL1, ATRX, BRAF, CALR, CBL, CBLB, CBLC, CSF3R, FBXW7, FLT3, GATA1, GATA2, GNAS, HRAS, IDH1, IDH2, JAK2, JAK3, KIT, KRAS, KMT2A/MLL, MPL, MYD88, NOTCH1, NPM1, NRAS, PDGFRA, PTEN, PTPN11, SETBP1, SF3B1, SMC1A, SMC3, SRSF2, TET2, TP53, U2AF1 and WT1) by 568 amplicons (length range: 225–275 bp). The combined coverage was 141 kb. Amplicon
sequencing libraries were prepared from 50 ng of DNA per sample using TruSight myeloid
sequencing panel (Illumina, USA). A highly multiplexed pool of oligonucleotide pairs
upstream and downstream to each region of interest (ROI) was employed. Each oligonucleotide
contained unique target-specific sequences and universal adaptor sequence used in
the subsequent amplification reaction. For each sample, an extension-ligation reaction
extended across the ROI and followed by ligation to unite the two probes to yield
a library of templates with common ends. This library of new templates was PCR amplified
with a unique pair of indexes incorporated for downstream sequence-based sample identification.
After PCR clean-up, double-stranded DNA length and quantity of individual libraries
were assessed by DNA 1000 kit and 2100 Bioanalyzer system (Agilent, USA). Libraries
were normalized according to the measured quantity and pooled in batches (8 to 24
libraries per pool). Paired-end sequencing runs were performed on a MiSeq (Illumina,
USA) with reagent kit v3 according to manufacturer’s instructions.

Variant calling and annotation

Paired sequences obtained from each sample were mapped to human genome reference GRCh37/hg19
using BWA-MEM 20] version 0.7.7 with default parameters. Three variant callers were used: (1) Samtools
21] version 0.1.19, with mpileup command parameters -L 100000 -d 100000 to cater for
amplicons with depth exceeding 250-fold and bcftools command parameter-m0.99 to use
the new insertion-deletion (INDEL) calling model; (2) GATK HaplotypeCaller 22] version 2.8-1 according to the best practices recommended by the authors; and (3)
VarScan 23] version 2.3.7 somatic mutation calling mode based on one of the negative controls
or the matched germline DNA if available. For detection of FLT3 internal tandem duplication (ITD), additional variant callers were used specifically
for the region chr13:28607161–28609590: (1) Pindel 24] version 0.2.5a7 with the insert size configured as the summation of forward and reverse
sequencing read length, to adapt the algorithm to the amplicon sequencing reads in
this study, and (2) a novel algorithm ITDseek developed in this study (details described
in a separate section).

Variant calls were first annotated by Ensembl Variant Effect 25] Predictor version 75 and then manually examined by at least two individuals. Sequence
alignment of selected variants was manually examined with Integrative Genomics Viewer
26] (IGV). ROI sequencing depth was summarized using BEDTools 27] version 2.19.1 Minimum reportable variant allele frequency (AF) is 10 % of sequencing
depth at least 500-fold. ROI with depth less than 500-fold were regarded as sub-optimal
regions. Variants found to be reported in COSMIC database version 67 and/or dbSNP
version 138 were prioritized for manual examination while those reported in 1000 Genomes
project (phase 1) were excluded. FLT3 ITD and ASXL1 c.1934dupG mutations were confirmed in selected patients using PCR fragment analysis
by capillary electrophoresis (primer sequences available upon request) and analysis
software Peak Scanner version 1.0 (http://www.appliedbiosystems.com). Variants were described according to the recommendations of Human Genome Variation
Society (HGVS). Variant descriptions were checked by Mutalyzer Name Checker (http://mutalyzer.nl).

Evaluation of FLT3 ITD detection algorithms and development of a novel detection algorithm ITDseek

FLT3 ITD detection performance of Pindel, Samtools, GATK HaplotypeCaller were compared based
on a simulated dataset of ITD mutations. An in silico FLT3 ITD sequencing read simulator ITDsim was developed based on BioPerl 28]. The amplicon targeting chr13:28,608,112–28,608,312 (both primer binding sites excluded)
was chosen as the simulation target due to its highest reported FLT3 mutation rate in COSMIC among the three amplicons in the region. All combinations
(n?=?40,401) of ITD lengths (range: 1–201 bp; n?=?201) and starting positions (chr13:28,608,112–28,608,312; n?=?201) were simulated separately. ITD allelic burden was defined as 50 %. For each
combination, 1000 paired-end ITD reads and 1000 paired-end wild-type reads (both 2?×?275 bp)
were simulated and the FASTQ file pair was subject to variant calling as described
above. Simulation and corresponding variant calling were performed on a Cray XC30
supercomputer.

To overcome the difficulty in calling long ITD mutations with short-read NGS amplicon
sequencing, a novel FLT3 ITD detection algorithm ITDseek was developed based on the following principles.
It takes SAM/BAM alignments as input and outputs any detected ITD mutations in VCF
format. In case of a short ITD (e.g., 9 bp) present in the middle of raw sequencing
reads (e.g., 250 bp), the wild-type sequences upstream and downstream of ITD mutation
are long enough for proper alignment to the FLT3 locus, with an insertion in between representing the additional sequence (operation
“I” in CIGAR field of SAM alignment output by BWA-MEM). General-purpose variant callers
like Samtools, GATK HaplotypeCaller and VarScan could then readily identify the inserted
nucleotides. In contrast, if a given ITD is too long and/or too close to either end
of amplicon, the sequence downstream of ITD mutation will become too short or even
completely absent in the raw sequences obtained. Without long enough downstream sequences,
the additional duplicated nucleotides are marked as soft-clipped instead (operation
“S” in CIGAR field). These soft-clipped nucleotides are usually ignored by general-purpose
variant callers as if they are sequencing adapter. In principle, realignment of soft-clipped
nucleotides is a way to identify possible ITD mutations. Since BWA-MEM outputs a separate
secondary alignment representing those soft-clipped nucleotides, such realignment
is effectively performed. However, secondary alignments are usually ignored by general-purpose
variant callers. ITDseek specifically searches for any soft-clipping points in primary
alignments and correlate them with any corresponding secondary alignments for ITD
identification. The length of ITD was extrapolated by the distance between the point
of soft-clipping and the beginning of realignment of soft-clipped bases. The analysis
is performed in individual reads and directions separately to identify any rare ITD
clone (as rare as a single read only) or multiple ITD clones. In case of a special
FLT3 ITD type that the additional sequence is entirely insertion of unknown origin, there
is no secondary alignment. ITDseek identifies such ITD by comparing points of soft-clipping
in sequencing read pairs. ITDseek was also evaluated based on the same simulated dataset
described before. Source codes and documentation of ITDsim and ITDseek are available
at http://github.com/tommyau/ITDseek for non-commercial use.