DiSNPindel: improved intra-individual SNP and InDel detection in direct amplicon sequencing of a diploid

Overview

DiSNPindel is implemented with the main sequential steps for SNP detection: ‘1. Open
a file’, ‘2. Find SNPs’, ‘3. Manual modification (optional)’ and ‘4. Save result’,
each corresponding to a button or box on the interface (Additional file 1: Figure S1). If continuous double peaks are found, the ‘Switch to Indel detection’
button can be clicked to ‘Indel detection’ interface, where ‘3. Find Indels’, ‘4.
Manual modification (optional)’ and ‘5. Save result’ were designated for the specific
functions (Additional file 1: Figure S2).

DiSNPindel is a stand-alone package programmed in Matlab R2011b and LabWindows/CVI
9.0. It runs on Windows platform and can deal with multiple traces (.ab1 and/or .scf
files), each being analyzed in an independent panel that is switchable between SNP
and InDel interfaces. The method is composed mainly of four procedures, namely, noise
filtering, feature extraction, SNP diagnosis and, if applicable, InDel diagnosis.
Figure 1 summarizes the overall structure of DiSNPindel.

Fig. 1. The overall structure of DiSNPindel. Haar wavelet transformation 11] was used to filter out the high-frequency noisy sub-signals of a base peak (wave).
Three features were extracted from the primary and secondary peaks at a base position,
namely, horizontal distance, vertical height ratio and half-wave width ratio, which
were then inputted into LM-BPNN for intra-individual SNP diagnosis. The LM-BPNN contained
three, ten and one neurons (vectors) in the input, hidden and output layers, respectively.
Intra-individual InDel diagnosis was conducted using a stepwise allelic base alignment
algorithm

Noise filtering

The Haar wavelet transformation 11] was used to decompose a peak (wave) signal into a low-frequency and a high-frequency
sub-signals. While the high-frequency sub-signal would be removed as noises, the low-frequency
sub-signal was subjected to further decomposition. The decomposition equation is:

where f(t) is the original signal, A is the approximation of low-frequency sub-signal (or further sub-signal), D is the details of high-frequency sub-signal (or further sub-signal) and n is the number of decomposition levels. More details of the Haar functions together
with their parametric notations could be seen in literature, e.g., Stankovi? and Falkowski
13].

Feature extraction

For each chromatogram wave, horizontal distance, height and half-wave width were sampled
from the primary (top) peak and, if applicable, the secondary (lower) peak, to represent
the uniqueness of a wave position. Distance, height ratio and half-wave width ratio
between the double peaks were then extracted as wave features efficient for subsequent
diagnoses. Figure 2 shows the three features extracted from double peaks at a wave position.

Fig. 2. Three features extracted from double peaks (waves) at a base position. L, M and R indicate the positions of the left bottom, the middle top and the right bottom of
a wave, respectively. a The horizontal distance (|x1
???x4
|). b The vertical height ratio (|y1
|/|y4
|). c The half-wave width ratio [(|y1
???y2
|?+?|y1
???y3
|)/(|y4
???y5
|?+?|y4
???y6
|)]

Intra-individual SNP diagnosis

Intra-individual SNPs were diagnosed with LM algorithm 14], 15] based BPNN 16], 17], which consisted of three layers: an input layer (the wave features), a hidden layer
and an output layer (the score; Fig. 1). Using simulated within-individual SNP data, the LM-BPNN was trained for the output
layer to meet a SNP score specification.

A total of 590 within-individual SNP samples synthetic of a wide range of the three
wave features were simulated, including 443 and 147 for training and validation, respectively.
An output value of each sample was generated using a fuzzy reasoning method 20], 21] and de-fuzzified to a score within the range 1–100.

LM-BPNN training was performed using the following weights and thresholds assumed
for the three input, ten hidden and one output vectors (Fig. 1),

a) The weight vector between the first input neuron and the ten neurons of the hidden
layer: [w(1,1)1
?
w(1,2)1
?…?w(1,10)1
],

b) The weight vector between the second input neuron and the ten neurons of the hidden
layer: [w(2,1)1
?
w(2,2)1
?…?w(2,10)1
],

c) The weight vector between the third input neuron and the ten neurons of the hidden
layer: [w(3,1)1
?
w(3,2)1
?…?w(3,10)1
],

d) The threshold vector of the hidden layer: [b(1)1
?
b(2)1
?…?b(10)1
],

e) The weight vector between the ten neurons of the hidden layer and the output layer:
[w(1,1)2
?
w(2,1)2
?…?w(10,1)2
], and

f) The threshold of the output layer: [b(1)2
].

In practice, the thresholds [b(1)1
?
b(2)1
?…?b(10)1
] and [b(1)2
] were treated as specific weights [w(0,1)1
?
w(0,2)1
?…?w(0,10)1
] and [w(0,1)2
], respectively. To determine the weight w(i,j)k
(i?=?0, …, 3; j?=?1, …, 10; k?=?1, 2), the LM algorithm is used

where w is the weight vector, ?w is the deviation of w, J(w) is the Jacobian matrix of vector w, ? is a coefficient, I is the identity matrix, and e(y) is the error. The learning steps are as follows.

1. Values between 0 and 1 were randomly assigned for initial weights and thresholds
assuming a maximum error ??=?0.1.

2. Compute the BPNN output, e.g., for the kth sample, where v is the input vector, w is the weight vector, and b(1)2
is the threshold of neuron yl
. The error is thus calculated as e k
?=?d k
???y k
, where d k
is the score calculated from the simulated data.

3. Compute the sum of square errors , where N is the number of samples.

4. If V(w)???, turn to step 7 below. Otherwise, compute the Jacobian matrix J(w).

5. Compute ?w using the LM equation as stated above.

6. Let w(t?+?1)?=?w(t)?+??w and compute the new sum of square errors V(w(t?+?1)) similarly as in step 3 above. If V(w(t?+?1))??V(w(t)), set V(w(t))?=?V(w(t?+?1)), w(t)?=?w(t?+?1) and ??=??/? (? is a correction factor) and turn back to step 4; Otherwise, suppose ??=???·?? and turn back to step 5.

7. Reach the optimal weights and end the training process.

Finally a true SNP was trained to a score ?75, a vague SNP to a score smaller than
75 but no less than 60, a false SNP to a score smaller than 60 but no less than 20,
and a strongly false SNP to a score 20. In addition, using another set of 147 simulated
samples, the performance of the trained LM-BPNN was tested.

A six-grade classification was established for the convenience of SNP diagnosis (Fig. 3), that is, grade 1 with a score???75 (a true SNP), grades 2–4 being 60???score??75
(a vague SNP), grade 5 being 20???score??60 (a false SNP) and grade 6 with a score??20
(a strongly false SNP). Grades 2–4 were further distinguished with the number of noisy
waves (height and width more than 70 and 50 % of the secondary peak, respectively),
that is, grades 2 (a true SNP), 3 (a possibly true SNP) and 4 (a possibly false SNP)
with no more than one, two and more than two noisy waves around, respectively. Grades
1 and 2 could be a high threshold for SNP diagosis while grades 3 and 4 could be a
relaxed threshold. Meanwhile, each SNP detected could be manually modified.

Fig. 3. The six SNP grades classified based on the score and noisy peaks around. The score
range for each grade was shadowed in black. Description of the number of noisy peaks
around was stated on the column for grades 2–4

Intra-individual InDel diagnosis

A stepwise allelic base alignment algorithm was employed for intra-individual InDel
detection. A maximal InDel size was set at 30 bases according to Bhangale et al. 9]. The presumed primary and secondary peak sequences were compared for a given region
dominated with continuous double peaks, supposing an interval of m (m?=?±1, ±2, … or ±30) bases to reach the maximal matchability and allowing for base
transposition between top and secondary peaks at a potentially misleading position.
The final m value indicates an InDel of the m bases, and its signal?+?or – represents the presence of an insertion or deletion
as compared to the alternative sequence.

Datasets

As there was no ‘standard’ dataset tested with earlier detection tools, we benchmarked
our software DiSNPindel with the three packages Mutation Surveyor, novoSNP and PolyPhred
using a set of 62 Eucalyptus EST amplicons (Additional file 2) for SNP detection, which had been directly sequenced for wet-lab validation of a
total of 66 SNPs associated with cleaved amplified polymorphic sequence (CAPS) markers
in one or both parents of an F
1
mapping population 22].

We also compared the performance of DiSNPindel with the four packages Mutation Surveyor,
novoSNP, PolyPhred and PrimeIndel for InDel detection using 77 directly sequenced
amplicons (Additional file 2) that contained intra-individual variation in simple sequence repeats (SSR) in either
or both of the parents of an F
1
mapping population 23].