Comparing K-mer based methods for improved classification of 16S sequences

Data

To compare methods we used two data sets. The Trainingset9 is the data used to compare 16S classification methods in 19], and was downloaded from RDP 11]. It consists of 10032 16S rRNA sequences varying from 320 to 2210 bases in length,
with the majority around 1400 bases. There are 37 phyla and 1943 genera represented
in this set.

The SilvaSet is an extract from the SILVA database 12], where the largest genera have been ‘pruned’ by random sampling to contain fewer
sequences. This set has 29520 sequences, covering 29 phyla and 1533 genera. The main
reason for including this data set is that it is a manually curated data set different
from Trainingset9, which was used during the development of the RDP-classifier.

In this paper we only consider classification to genus, i.e. the lowest taxonomic
level of these data. This is the most challenging and also the most relevant problem
for most studies where taxonomic classification is important.

The distributions of sequence abundance across genera are skewed for both Trainingset9 and SilvaSet. Genera with only one sequence available are by far the most common in Trainingset9 (Fig. 1). These singleton genera were included in the analysis, but will always be mis-classified
by all methods, and all reported errors exclude these sequences. For Trainingset9 few genera have more than 15 sequences, while some genera are considerably larger
(not shown). The genus with most sequences is Streptomyces, which consists of 513 sequences. In the SilvaSet the difference in genus sizes is not as pronounced as in Trainingset9, but the majority of genera consists of 40 or less sequences. The genus with most
sequences is Pseudomonas with 115 sequences.

Fig. 1. The size distibution histogram of genera in Trainingset9 and SilvaSet. The panels show the size distibution of genera in Trainingset9 and SilvaSet. Genera with more than 50 sequences are not included here, but in Trainingset9 there are 26 and in SilvaSet there are 40 such genera

To estimate the model performance we conducted a 10-fold cross validation 23] for all methods. The data were ordered alphabetically by genus name and split into
ten different segments by enumeration from one to ten repeatedly, and then assigned
to segments according to this number, i.e. every tenth sequence belongs to the same
segment. This ensured a maximum spread of all genera across the segments. Each segment
was set aside once as a test set, while the rest were used as training set in each
cross-validation iteration.

K-mer based methods

All methods compared here represent a 16S sequence by its overlapping K-mers, i.e. words of length K. There are D=4
K
possible words of length K in the DNA (RNA) alphabet, and in our study we tested word lengths from two to eight.
The methods tested differ in the way they represent a sequence as K-mers and how this
information is utilized in a statistical learning algorithm to achieve best possible
classification.

All five methods were implemented in the software environment R 24]. Our implementation of the RDP classifier was tested against the original Java-implementation
to ensure consistency. The PLS and nearest-neighbour methods already exist in the
R-environment.

RDP

The RDP method considers only the presence/absence of a word in a sequence, not its
frequency. All words of length K are ordered alphabetically as w1
,w2
,…,w D
. For every sequence, we create a vector of D elements where element j is 1 if word w j
is present in the sequence, and 0 if not. We have chosen to describe the RDP method
in detail below, even if this has been done in 19], because this method serves as a reference for the other methods described later.

Training

For each of the N sequences in the training set we get a vector of 1’s and 0’s, and these vectors are
arranged as rows in the N×D matrix Ar d p
.

First, we estimate the unconditional probability: The probability of presence of each
word regardless of genus. Summing the elements in each column of Ar d p
produces the vector n1
,n2
,…,n D
, i.e. n j
is the number of sequences in the training set where word w j
is observed at least once. The probability that word w j
will be found present in any sequence is estimated by

(1)

where the added 0.5 and 1 guarantees that no probability is zero or one.

Next, consider only sequences from genus g, i.e. we consider a sub-matrix containing only the M g
rows corresponding to genus g. Again we can sum over the rows of , and we get the vector m g,1
,m g,2
,…,m g,D
, i.e. m g,j
is the number of sequences from genus g where we observe the word w j
at least once. The genus-specific or conditional probabilities are estimated by

(2)

If the training set contains data for G genera, we can arrange the probabilities q g,j
in a G×D matrix Qr d p
where the element in row g and column j is q g,j
, for g=1,…,G, j=1,…,D. This matrix Qr d p
is the trained model, with a set of probabilities (a row) for each genus.

Classification

Given a new sequence we construct the vector a corresponding to a row in the matrix Ar d p
from above. Element j in a is 1 if word w j
is found in the new sequence, and 0 otherwise. The unconditional probability of a is found from (1) by

(3)

where a j
is element j in a and p j
is from (1). Notice that P r(a is a joint probability of observing the words we see in this sequence. The naïve
Bayes approach lies in the assumption that this joint probability can be written as
a product of the marginal probabilities, as we have done on the right hand side above.
This assumption is correct only if the elements of a are independent, which is a naïve assumption, but often still works in a satisfactory
manner.

The conditional probability of a given some genus g is computed in a similar way from (2) by

(4)

From the general relation between conditional and marginal probabilities it follows
that

(5)

where the probability on the left hand side is the criterion we use to classify. This
is the posterior probability of genus g given the observed sequence a , and we classify to the genus that maximizes this probability. On the right hand
side we have the prior probability of genus g, P r(g), in addition to the two probabilities we computed in (3) and (4). It is customary to set the prior probability equal to the proportion of data from
genus g in the training data set. In the RDP classifier the prior probabilities are assumed
to be equal for all genera, and genera with few sequences are just as likely to be
observed as those with many sequences in the training set. In our study we considered
both flat priors (RDP) as well as priors proportional to genus abundances.

The posterior probability P r(g|a ) is computed for every genus, and we assign the sequence to the genus where we get
the largest probability. Notice that the denominator P r(a ) in (5) does not depend on genus g. Hence, the g that maximizes P r(g|a ) is exactly the same g that maximizes P r(a |g)P r(g), and we can ignore P r(a ) altogether. Also, if the prior probabilities P r(g) are identical for all genera, we get the simple relation P r(g|a )=P r(a |g).

From a computational perspective, we prefer the log-transformed version of (5) (ignoring P r(a )), and using the relation in (4) we get

(6)

since this log-probability is maximized for the same g as the one in (4). If the matrix Qr d p
from the training step is log-transformed and called Lr d p
, and p is the column-vector of the G log-priors for all genera, we can compute the score vector

(7)

as the inner product of Lr d p
and the column vector a?
. The score vector z has one element for each genus, and we assign to the genus where z has its maximum value. In case of two or more genera obtaining the same maximum value,
the sequence is marked as unclassified.

Notice that with flat priors, the terms log2(P r(g)) are identical for all g, i.e. all elements of p are identical, and it can be omitted from (7) since it will add the same to all genera.

Multinomial

The Multinomial method differs from the RDP method by considering the relative frequency
of every word instead of presence/absence. The naïve Bayes principle is the same.
A similar approach has also been tested by Lui and Wong in their work in 22].

Training

For each of the N sequences in the training set we get a vector of frequencies, i.e. element j is the number of times we observe w j
in the sequence. These vectors are arranged as rows in the N×D matrix Af r q
.

As before we consider a sub-matrix containing the M g
rows corresponding to genus g. Summing over the columns of we get a vector m g,1
,m g,2
,…,m g,D
. The genus-specific frequencies F(w j
|g) are:

(8)

where pseudo-counts are added to each frequency to avoid 0 counts. The multinomial probabilities
for genus g is then calculated by dividing each F(w j
|g) by their respective row sum, giving us a new set of multinomial probabilities q g,i
:

(9)

The trained model consists of the (G×D) matrix Qm l t
where row g contains the multinomial probabilities q g,j
for genus g.

Classification

From the new sequence we construct the frequency vector a corresponding to a row in the matrix Af r q
above. Again we use the naïve Bayes approach to compute a scorevector z :

(10)

where Lm l t
is the log-transformation of Qm l t
from the training step and p are the log-priors just as for the RDP-classifier. The score vector z has one element for each genus, and the sequence is assigned to the genus with maximum
score in z . In case of two or more genera obtaining the same maximum value, the sequence is
marked as unclassified.

Markov

In the present context ordinary Markov models consider word frequencies, but differ
from the naïve Bayes principle used by the previous two methods. Markov models have
been tested on sequence data with the K-mer approach in earlier studies, e.g. by Davidsen et al. 25].

Training

The training step corresponds to estimating the transition probabilities of the Markov
model. Any word of length K can be split into the pretext consisting of the first K?1 symbols, and the last letter, being A, C, G or T. The transition probabilities
are the conditional probabilities of the last letter given the pretext. These probabilities
are usually organized in a transition matrix with 4 columns (one for each letter)
and one row for each pretext (4
K?1
rows). However, these probabilities can equally well be organised in a single row-vector,
where the conditional probabilities of A given the ordered pretexts is found at positions
I A
=(1,5,9,…), for C given the ordered pretexts in positions I C
=(2,6,10,…) and so on. Note that this corresponds to the K-mers in alphabetical order. Each consecutive four positions corresponds to the same
pretext, extended by A, C, G and T, respectively.

The matrices Af r q
and are computed as for the Multinomial method. Summing over the columns of again produces genus-specific frequencies F(w j
|g) as in (8). If K-mer w j
contains pretext h followed by, say, A, then the corresponding genus-specific transition probability
is estimated by

(11)

and similar if the pretext is followed by C, G or T, I A
is replaced by the corresponding index set. If we had organized the transition probabilities
in a matrix, this value would appear in cell (h,1) since we consider pretext h followed by A (column 1). Instead we arrange these probabilities in a row vector
of D elements. Having the transition probabilities for each genus, we arrange the vectors
as rows in a (G×D) matrix Qm r k
. The latter organization of the transition probabilities is done only to have the
same data structure as for the other methods; it does not affect the computations.

Classification

From the new sequence we count K-mers as for the Multinomial method, constructing the frequency vector a corresponding to a row in the matrix Af r q
. We compute scores for the sequence as

(12)

where Lm r k
is the log-transformation of Qm r k
. Again we classify to the genus yielding maximum score. In case of a tie, the sequence
is marked as unclassified.

Nearest-neighbour (NN)

In this method we use nearest-neighbour classification based on multinomial probabilities.
Nearest-neighbour methods have no specific training step, but use the training data
as a database and perform a lookup based on some characteristics of the query sequence.
Another 16S nearest-neighbour method, called the Similarity Rank tool, was published
by Maidak et al. 26] for use in The Ribosomal Database Project.

As before we compute the (N×D) matrix Af r q
by word counting, where N is the number of sequences in the training set. Then we divide all elements in a
row by its row-sum to obtain multinomial probabilities, and these are stored in the
(N×D) matrix Am l t
. Thus, each training sequence, with its labelled genus, is represented as a row in
this matrix.

For every new sequence we also count word frequencies and divide by the number of
words in the sequence, producing a vector a similar to a row in Am l t
. The Euclidean distance from a to all sequences (rows) in the training set is computed. The new sequence is assigned
to the same genus as the nearest neighbour in the training set. In case of a tie,
i.e. two or more genera are nearest neighbours, it is left unclassified.

Preprocessed nearest-neighbour (PLSNN)

In this method we extend the nearest-neighbour by combining it with the partial least
squares (PLS) method 27]. This is a supervised learning method that has been used in many bioinformatics applications
(e.g. 28]–32]). A reason for the wide-spread use of PLS is that it is especially applicable when
we have many correlated explanatory variables, which is typical for the present K-mer data, especially as K increases.

The idea is to compute a linear mapping from the K-mer frequency space to a much lower dimensional space, and then look for nearest-neighbours
in this low-dimensional space. In K-mer space every sequence has D=4
K
coordinates, and in the nearest-neighbour method above all coordinates (K-mers) have equal weight. However, it is more than likely that some of these will
be more or less important for recognizing a particular genus. Replacing the original
D dimensional space by a smaller number of combinations can be seen as a preprocessing
of the data before the nearest-neighbour step, hopefully resulting in more ‘correct’
distances between sequences when seeking the nearest neighbour.

Training

From the training data we again compute the (N×D) matrix Am l t
as above. This is used as the matrix of explanatory variables in training the PLS-method.
The response is the genus for each sequence. This is coded as a row-vector of G elements, with 1 in position g if the sequence comes from genus g and 0 in all other positions. This assembles into an (N×G) matrix Y .

The PLS assumption is based on the linear model

(13)

where ? is some (D×G) vector of regression coefficients. The algorithm will search for an orthogonal sub-space
by combining the variables (columns) of Am l t
and maximising the covariance between Y and Am l t
. The algorithm first finds the 1-dimensional sub-space, then the 2-dimensional, etc.
The main idea is to stop the search after C dimensions, where CD but still enough to have a good fit according to the model in (13). This means we end with

(14)

where the (N×C) dimensional matrix S consist of linear combinations of the columns in Am l t
, and R is some orthonormal projection matrix. The rows of S are the training sequences represented in the C-dimensional subspace with maximum covariance to genus information. In this representation
we have filtered out less important variation in K-mer frequencies, e.g. variation within genera. Distances between sequences in this
space should be more sensitive to between-genus variation and less sensitive to within-genus
variation. For every word length K we tested 8 different dimensions C. The maximum was set to C max
= min(N?1,D?1,2000), and we used C=i C max
/8 for i=1,2…,8.

Classification

For every new sequence we compute a vector a similar to a row in Am l t
. From (14) it follows that Am l tR ?S since R is orthonormal, and thus we can compute s =a R . The vector s is the representation of the new sequence in the subspace spanned by S . The new sequence is finally classified with the nearest-neighbour method as before,
where Euclidean distances from s to all rows of S areconsidered.