Using expected sequence features to improve basecalling accuracy of amplicon pyrosequencing data

Combining Multipass with models for protein coding DNA

Three Plasmodium falciparum laboratory reference strains were resequenced multiple times using bi-directional 454 FLX Titanium amplicon sequencing (12 samples containing one or mixtures of isolates 3D7, DD2 and HB3). Specifically, a ~370 nucleotide region (st.dev. 25 nt, min. 319 nt, max. 468 nt) in the hyper-variable var gene family was targeted, encoding the malaria antigen PfEMP1, of which each parasite has 50-60 variants. The flowgram data was demultiplexed, MID-tags and primer sequences were removed, and reverse flowgrams were reverse complemented as described in the Methods section. The resulting median sample coverage was 2956 reads (Additional file 1: Table S1). Flowgrams from each sample were then subjected to different basecalling methods as described below, and the resulting nucleotide sequences were subjected to denoising and chimera removal to eliminate PCR artifacts. This resulted in a total yield from the 12 samples of between 548 and 599 sequence variants, with varying degrees of accuracy compared to the known reference database sequences (Fig. 1).

Fig. 1

Accuracy of Plasmodium falciparum reference strain amplicon resequencing using different basecalling methods. Shown for each basecalling method is the fraction of all sequences (provided in the legend as N) with a given number of errors. See Additional file 1: Figure S3 with log-scale for detailed frequencies of sequences with multiple errors

Direct 454 basecalling of each read into nucleotides, by deriving homopolymer lengths from flow values rounded to nearest integer, and subsequent denoising, resulted in a set of var sequences where 37.0 % matched the known database sequences perfectly (Fig. 1) while sequences in average had 1.49 errors (st.dev. 1.99, max. 14) (Additional file 1: Figure S3). In this approach, reads originating from the same template sequence was determined only by clustering after flowgram basecalling, and this yielded 548 sequence variants.

The current state of the art method AmpliconNoise clusters and aligns the flowgrams before calculating the most likely nucleotide sequence from each alignment. Applying AmpliconNoise to the combined forward and reverse flowgrams nearly doubled the number of error-free sequences to 68.8 % (Fig. 1), and a mean error count of 0.935 (st.dev. 2.04, max. 13) was observed (Additional file 1: Figure S3). Flowgram clustering increased the yield to 597 variants, most likely because the more correct sequences gave larger clusters during denoising, leading to a higher number passing the cluster size threshold of 3 reads.

Multipass was then tested, using the initial flowgram clustering of AmpliconNoise, but with novel implementation of flowgram alignment and calculation of the N most likely nucleotide sequences from each alignment. Multipass uses updated probability distributions for flow signals originating from homopolymers with length 5 nucleotides (see Methods). Hence a considerable improvement in sequence accuracy was observed for this dataset by simply selecting the most likely nucleotide sequence given by Multipass, increasing the fraction of perfectly basecalled sequences to 81.8 % (Fig. 1) and reducing the mean number of errors per sequence to 0.258 (st.dev. 0.644, max. 5) (Additional file 1: Figure S3).

Using Multipass to calculate the most likely nucleotide sequences from each flowgram alignment revealed that, especially for low coverage sequences, a considerable fraction of correct basecalls were ranked not as the most likely, but in the top ten most likely sequences (Fig. 2). Therefore, Multipass was set to calculate the N?=?10 most likely sequences, and in order to give the correctly basecalled sequences more support, expected sequence features were employed. So for each of the 10 most likely sequences, the likelihood of the sequence being correct given the flowgrams P(CBS|flows) was combined with the likelihood of the sequence being correct given a sequence feature P(CBS|feature), to give P(CBS|flows,feature) by which the most likely correct basecall was chosen. Any sequence feature enabling discrimination between correct and incorrect basecalls could be utilized.

Fig. 2

Ranking of the correct basecall according to P(CBS|flows). Upon flowgram clustering and alignment, Multipass was employed to calculate the fifty most likely basecalls given each flowgram alignment. The likelihood ranking of the correct basecall is shown against the number of flowgrams in the alignment. Maximally two hundred flowgrams from each cluster were used for alignment. The marker size is proportional to the abundance at a point

The first feature used in the model was the presence or absence of a full-length open reading frame, where the probability of having a FRF in a correctly basecalled sequence P(FRF|CBS) was determined by the frequency of FRFs in ~6800 DBL?-tag sequences obtained by whole genome Illumina sequencing [12]. Out of the 10 sequences provided by Multipass, the one with highest likelihood given flowdata and presence of full-length open reading frame P(CBS|flows,FRF) was selected. This approach raised the basecalling accuracy to 91.2 % correct sequences (Fig. 1, RF), and a mean of 0.189 errors per sequence (st.dev. 0.671, max. 5). Since the Illumina sequences used to establish P(CBS|FRF) may contain errors disrupting the full length reading frame, the true P(FRF|CBS) could be higher than the one used above. Therefore we tried an arbitrary high P(FRF|CBS)?=?1-1e-150, which gave an increase to 92.0 % correct sequences (Fig. 1, RFhip), however the mean number of errors per sequence also increased to 0.192 (st.dev. 0.700, max. 5), indicating that this model aggravated the condition of erroneous sequences (Additional file 1: Figure S3).

The second feature tried in the model was the match to a profile hidden Markov model (HMM) of the expected amino acid sequence, generated from DBL?-tags obtained from a small global selection of field isolates. This feature should be even more sensitive to frameshifts than FRF, since the HMM score is lowered even if the full-length open reading frame is retained upon introduction of an indel. The likelihood of a sequence being correctly basecalled given an HMM match score of a certain magnitude P(CBS|S
HMM
) was determined from score distributions (Additional file 1: Figure S2) as described in the Methods section. By selecting the most likely sequence given flowgrams and HMM match P(CBS|flows,S
HMM
), the accuracy reached 91.4 % correct sequences (Fig. 1), and a mean of 0.189 errors per sequence (st.dev. 0.673, max. 5).