Probabilistic topic modeling for the analysis and classification of genomic sequences

In this Section we present the 16S bacteria dataset used and we describe both the
experiments settings and the results obtained using the probabilistic topic modeling
approach for sequence classification. Our results are compared with other two algorithms
used for sequence classification: the RDP classifier and the support vector machine
classifier.

Datasets used

We evaluated our approach for gene sequences classification considering bacteria species.
For classification and taxonomic studies of bacteria, it is usually considered only
a limited part of the genome, about 1200-1400 bp, that is the housekeeping 16S rRNA
gene 3]. In our study we arranged a 16S dataset downloading the gene sequences from the Ribosomal
Database Project (RDP) repository 41], release 10.32. We chose the four richest phyla, Actinobacteria, Bacteroidetes, Firmicutes,
Proteobacteria, and, in order to retain a good quality dataset, we selected the 16S
sequences that satisfy the following constraints:

1 type strain, representing reference specimen;

2 size ? 1200 bp, considering this way full gene sequences;

3 good quality, according to the quality parameters provided by the RDP repository;

4 NCBI taxonomy, i.e. sequences are labeled with the NCBI taxonomic nomenclature 42].

Moreover we left out unclassified sequences and taxonomic ranks with lesser than ten
sequences, in order to obtain a well balanced dataset. Using these criteria, we set
up a 16S dataset consisting of 7856 sequences, whose main features are summarized
in Table 1.

Table 1. Main features of the 16S bacteria Dataset.

Experimental setup

The experiments proposed in this paper, aimed at validating the probabilistic topic
modeling approach, represent an expansion and an in-depth analysis of our previous
work 43]. There, with a smaller dataset of 3000 sequences, we carried out a series of trials,
using a tenfold cross-validation procedure, in order to test how the classification
results varied with regards to the number of topics and the dataset composition. We
obtained, with k -mer size = 8, global results ranging from 99% of precision score at phylum taxonomic
level to 80% at family level. In all cases, we noticed that the best scores were reached
only when the number of apriori fixed topics is at least equal to the number of different
categories of the input dataset. For example, if we want to classify our dataset at
order level, we have to train a topic model with a number of topics equal or greater
than the number of orders. Of course only in an ideal situation the number of topics
matches exactly with the number of categories, in fact in our previous study we obtained
better results with a larger number of topics, about two times the number of categories,
considering a situation in which each different class covers, in average, two most
probable topics. In this work, we enriched that experimental pipeline first of all
taking into account a bigger dataset consisting of 7856 gene sequences, described
in “Dataset used” section. Moreover, in order to tune the choice of the number of
topics, the probabilistic topic models were trained in a hierarchical way. That means
we fitted a different topic model at each taxonomic level, for the four different
phylum. Considering the Firmicutes phylum, for instance, in order to classify at class
level, we trained a model considering an input training set composed of all the Firmicutes
sequences. In order to classify at order level, we trained a different topic model
for each of the four different classes of Firmicutes phylum (look at Table 1 for info
about the number of categories of our bacteria dataset), and so on. As a general rule,
we considered, for each topic model a number of topics equal to one time and two times
the number of lower categories: if one class has four orders, for that class we trained
a topic model with four and eight topics. Once again all the tests have been carried
out by means of a ten fold cross-validation procedure.

Unlike our previous work, in this paper we also evaluate the robustness and the generalization
ability of our approach with respect to the sequences length. For this reason, we
tested our method also with small sized sequences, considering respectively sequence
fragments of 400, 200, 100, 50, 40, 25 bp. In this case we submit to the testing workflow
a fragment of length f (with f = 25, 40, 50 and so on) randomly extracted from the full length sequence and we consider
the output classification. The need of a robust classifier able to correctly predict
the taxonomic rank of small DNA fragments is of fundamental importance in metagenomics
applications, where genetic sequences are mainly extracted from environmental species
and in many cases ultra short sequences, with size ? 50 bp, are available 44].

Classification results, in terms of precision scores (Eq. 7), were compared with other
two sequence classifiers: the RDP classifier 20], and the SVM classifier. The former consists of a naive Bayesian classifier trained
on a k -mer representation of the sequences, the latter works on a vector representation
of the gene sequences obtained considering the number of k -mers occurrences. We adopted the SVM implementation provided by the R package e1071 45], that allows a simple interface with the well known LIBSVM library 46]. SVM has been run with default parameters and Gaussian Radial Basis kernel.

Experimental results

The precision scores obtained using our probabilistic topic modeling approach, the
RDP classifier and the SVM classifier, for the 16S rRNA dataset described in Section
Training dataset, are organized in the charts of Figures 4 to 7. The precision scores are average results obtained by means of a ten fold cross-validation
procedure. Each chart shows the score trends as a function of the fragment size (full
length, 400 bp, 200 bp, 150 bp, 50 bp, 40 bp, 25 bp) at a different taxonomic rank,
from phylum to family. Unfortunately, RDP classifier works only with sequences of
at least 50 bp: in fact with fragments of size 40 bp and 25 bp it is unable to provide
a classification results. This way precision scores for 40 bp and 25 bp fragments
have been linearly extrapolated for the RDP curve. In all the charts, extrapolated
values are represented with the dashed line for the RDP curve. From all the charts,
it is immediately clear that the SVM classifier provides acceptable precision results,
ranging from 99% at phylum level down to 97% at family level, only when applied to
full length sequences. In all other situations, the SVM algorithm drops significantly
its performances. In fact, with sequence sizes from 400 bp to 25 bp, the SVM looses
completely its predictive power, resulting useless when applied to sequence fragments.
This behaviour reflects the fact that the vector representation of sequence fragments
is quite different from the vector representation of the full sequences composing
the training set. SVM, therefore, is not able to generalize the prediction of small
sequences. Our approach, briefly called LDA approach from here on, and the RDP classifier
show, on the other hand, more robust an significant results. LDA and RDP, in fact,
always produce very similar results, at each taxonomic level and for each sequence
size, from full length to 50 bp. The LDA’s precision scores are slightly lower than
the results obtained through the RDP classifier, with an average spread within 10%,
and maximum scores greater than 70% in each case. Our LDA approach shows its effectiveness
when applied to ultra short fragments, i.e. 50 bp, 40 bp and 25 bp. Considering 50
bp fragments, the LDA and the RDP scores are very close, within 5% of difference,
but while the RDP classifier works only with fragment size of at least 50 bp, our
LDA approach gives very reliable results, about 70%, even with fragment size of 40
and 25 bp. That means, for example, that with only 25 nucleotides, we are able to
predict the family of an unknown sequence with a 70% confidence. Moreover at class,
order and family level, our LDA approach not only gives an affordable classification
results, but if we compare these scores with the ones extrapolated for the RDP classifier,
we obtain higher scores. This behaviour is evident above all in the family case, Figure
7, where the LDA method surpasses the RDP score with 50 bp fragments, with an increase
of 11%, and, if we consider the estimated scores of RDP at 25 bp, the performance
increment is about 140%. Furthermore in this chart we can observe how the performance
decrease of the LDA approach is very smooth, while the RDP classifier shows a rapid
decrease, with a performance drop with respect to ultra short sequences (50, 40 and
25 bp).

Figure 4. Precision scores at phylum level. Precision scores, defined as true positives/(true positives + false positives), trends as a function of the sequence size (full length, 400 bp, 200 bp, 150 bp,
50 bp, 40 bp, 25 bp), for the Latent Dirichlet Allocation (LDA), Ribosomal Database
Project (RDP) and Support Vector Machine (SVM) classifiers at phylum taxonomic rank.
The dashed line for the RDP curve represents extrapolated values.

Figure 5. Precision scores at class level. Precision scores, defined as true positives/(true positives + false positives), trends as a function of the sequence size (full length, 400 bp, 200 bp, 150 bp,
50 bp, 40 bp, 25 bp), for the Latent Dirichlet Allocation (LDA), Ribosomal Database
Project (RDP) and Support Vector Machine (SVM) classifiers at class taxonomic rank.
The dashed line for the RDP curve represents extrapolated values.

Figure 6. Precision scores at order level. Precision scores, defined as true positives/(true positives + false positives), trends as a function of the sequence size (full length, 400 bp, 200 bp, 150 bp,
50 bp, 40 bp, 25 bp), for the Latent Dirichlet Allocation (LDA), Ribosomal Database
Project (RDP) and Support Vector Machine (SVM) classifiers at order taxonomic rank.
The dashed line for the RDP curve represents extrapolated values.

Figure 7. Precision scores at family level. Precision scores, defined as true positives/(true positives + false positives), trends as a function of the sequence size (full length, 400 bp, 200 bp, 150 bp,
50 bp, 40 bp, 25 bp), for the Latent Dirichlet Allocation (LDA), Ribosomal Database
Project (RDP) and Support Vector Machine (SVM) classifiers at family taxonomic rank.
The dashed line for the RDP curve represents extrapolated values.