leBIBI QBPP : a set of databases and a webtool for automatic phylogenetic analysis of prokaryotic sequences

The use of leBIBI
QBPP
is straightforward because it only requires to paste a sequence (or a set of sequences)
in a webpage. It is also possible to use a test sequence, randomly chosen in a predefined
set, for demonstration purposes or to verify if the system is operational.

Data analysis

LeBIBI
QBPP
results appear in a page containing a summary of all computations. All files that
were used or generated by leBIBI
QBPP
are also accessible.

Report

The report (Fig. 1) summarizes the analysis and gives information relevant for the interpretation of
the phylogeny and the taxonomic assignment of the query sequence. A summary of the
database used is given along with statements about the consequences of the database
stringency on the obtained phylogeny.

Fig. 1. LeBIBI
QBPP
report summarizes the analysis and gives additional informations that may be useful
to interprete the phylogenetic tree

The nucleotide composition of the query sequence is given. If it contains any undetermined
bases, their amount is an indicator of the quality of the sequencing process. Too
many undetermined bases have a negative impact on the quality of the phylogeny and
this leads to a warning message. The length of the matching section of the first BLAST
hit is expected to be close to the length of the query (?95 %). Even if this does
not impair phylogenetic reconstruction, it may indicate a global or local low quality
of the query sequence. This indication may point out, for those that continue to use
the 97 % identity rule to identify bacterial or archaeal 16S sequences, that this
rule is not without shortcomings.

Phylogenetic reconstruction by leBIBI
QBPP
can be expected to be reliable when the output tree contains sequence clusters from
various species and genera around the query sequence. Ideally, an outgroup belonging
to another, closely related, genus is required to interpretate the phylogeny. Such
an outgroup genus is however not absolutely needed if the genus under consideration
contains multiple and phylogenetically distant species.

The goal of the identification of the proximal cluster is to indicate if the query
is inside, or close, to a taxonomically homogeneous cluster. Patristic distances between
different sequences belonging to the same species or genus help determining whether
the query sequence belongs to a given taxon. The closest TS to the query sequence
is also shown, as well as its presence in the closest cluster, this to potentially
link the query sequence to an existing classification. Even if this is somewhat an
approximation, a strain is not expected to be phylogenetically far from the strains
of the same species in term of patristic distance; the same is expected for a species
within the genus. Therefore, the position of the query sequence in the distribution
of intra-species and intra-genus patristic distances is given. The 75 % percentile
of these distributions was chosen because of the possible presence of outliers, essentially
ill-identified sequences.

The warning that may be output by the comparison between the proximal cluster and
the closest sequence indicates a possible phylogenetic reconstruction problem and
a careful reading of the tree, taking care of SH support values, may be useful.

Phylogenetic trees and alignment

The phylogenetic tree is labelled with sequence names that reflect the compliance
of species names with nomenclature and whether they originate from type strains. The
expression of the SH support level through branch width enables a direct interpretation
of branch robustness. A similar tree with SH support as numerical values is also accessible.

The “Taxo-Tree” has been designed to rapidly identify whether outliers have been erroneously
recruited by BLAST, but it may be also useful in the case where nomenclature does
not rigorously follow phylogeny, pointing out incoherences.

The sequence alignment is provided as a SVG file to enable a survey of alignment quality.

Mitigation of BLAST-induced unexpected phylogenetic tree

The BLAST algorithm searches a database for similar sequences, not for the phylogenetically
closest ones. Consequently, a very loosely related sequence gets sometimes recruited.
This kind of outlier sequences may lead to difficulties in tree interpretation and
is frequently characterized by the apparition of a long branch. A “reverse QBPP” procedure
may identify the problem: submitting the outlier sequence, easily accessed through
the tree hypertext links, to leBIBI
QBPP
will lead to a completely different set of selected species and tree.

Usability

Strategy of exploration and databases

The leBIBI
QBPP
databases and web tools are designed to quickly reconstruct a phylogeny with a SSU
rDNA (16S) or a housekeeping protein gene sequence. It also provides elements to help
the biologist to interpret the tree and especially to place the query sequence within
a known taxonomy rank. As underlined in the databases section above, several different
databases are available and QBPP gives a more informative answer if an adapted querying
strategy is followed.

The optimal strategy is a trade-off between the advantages of a large number of recruited
sequences (it increases the likelihood of having recruited all closely related sequences,
and the quality of the coverage of diversity and of the phylogeny), processing speed,
and ease in interpretation. The best protocol is to begin with a rather stringent
database to maximize phylogenetic and taxonomic diversity. Retaining at least 50 sequences
around the query reduces the risk of not recruiting phylogenetically closest strains
because they are far in the BLAST hit list. If a broad variety of taxa is obtained
(i.e., with external groups, especially genera), it is possible to reduce the number of
extracted sequences for better readability, but the user will have to verify that
there is no change in the closest clusters. On the contrary, if the tree does not
contain enough species diversity, it is necessary to increase the number of extracted
sequences or to try a more stringent database. It is always very important to test
the “lax” database because some sequences of important uncultured species are present
in this collection only.

The “lax” SSU rRNA database is the broadest, so its processing is the slowest. This
database contains a lot of sequences that are approximately or wrongly identified
or of poor quality and short (albeit 300 bp in length). It should be used to build
a phylogeny of the query sequence versus any prokaryote (cultivated, environmental),
but often the taxonomy will be difficult to interpret. It is exhaustive, but the phylogenetic
signal may be blurred by a swarm of approximately identified sequences, of low quality
and possibly redundant. The “lax” SSU rRNA database is therefore essentially exploratory,
more suited for research than for routine analyses.

The “stringent” SSU rRNA database is the best when it comes to the quality of the
phylogenetic reconstruction because it contains less sequences and has globally better
characterized items. It contains sequences of strains that are validly denominated
and represent all the biological diversity of species (besides another diversity created
by errors in naming or publishing of sequences). Some strains belonging to a species
whose members can be phylogenetically very distant (such lack of homogeneity is mostly
due to lack of strains or taxonomical studies) cannot be processed without using this
“stringent” database. On the contrary it may be impossible to use it in the case of
species that are highly represented in GenBank because the phylogeny cannot be computed
due to a large number of nearly identical sequences. In such a case the tree is saturated
by one species (or genetic variant) and cannot be interpreted. This database is also
the only one allowing to compute the distribution of the distances between sequences
within one species because it contains often more than one sequence for a given species.
Unfortunately it also contains incorrectly identified sequences, introducing uncertainty
or errors in the interpretation of the phylogeny.

The “TS-stringent” database (only available for SSU rDNA sequences) is less contaminated
by erroneous species names and generally the identification of the sequences is of
high quality. This is obtained by a decrease of the variety (mostly one strain per
species, the TS being present) that may lead to poor phylogenetic reconstructions
in the case of high genomic variations among the species and a TS that is not representative
of this diversity. The technical uncertainties in sequencing or possible contaminations
or tube-switching explains the observed incoherences of the position of multiple sequences
of the same type strain in the phylogenies. Unknown species are also more difficult
to position among the already validated species clusters. Uncultured species are mostly
missing as their TS are not defined.

The “superstringent” database (also only available for SSU rDNA sequences) is the
smallest and fastest to run. The sequences are selected to be representative of a
given species and are of high quality. Neither the diversity within a species, nor
technically induced biodiversity is represented. As in the similar BLAST database
entitled “rRNA typestrains/prokaryotic 16S ribosomal RNA” developed at the NCBI, uncultivated
prokaryotes are not present because of the absence of TSs for these species. This
is a database giving accurate phylogenies but that may be sometimes incorrect or not
representative of the biological reality. The uncultured species are missing as in
the “TS-stringent” database.

The “genus-level” database (also only available for SSU rDNA sequences) is the oddest
of all. Its sequences are selected to represent all recognized genera. This is mostly
useful to build very large phylogenies around a well identified query. Interpretation
of the resulting tree may be difficult without studies with less stringent databases.

Comparison with other functionally close solutions

LeBIBI
QBPP
is somewhat similar to other webtools combining the selection of sequences similar
to the query (by BLAST or other approaches), and a pipeline that performs multiple
alignment, and finally computes a phylogeny. The closest equivalent is the NCBI BLASTN
using the “16S ribosomal RNA sequence (Bacteria and Archaea)” database. This database
is similar to our “superstringent” database and it is possible to compute a phylogenetic
tree using a distance matrix built with BLAST pairwise alignments and either the Fast
Minimum Evolution 92] or Neighbor Joining (NJ) 93] algorithms. The alignment between the query sequence and sequences issued from the
BLAST search is not a multiple alignment and may only partially cover the query sequence.
This is not a true phylogenetic reconstruction unlike done by leBIBI
QBPP
where a global alignment is computed and the tree is built with the ML approach which
outperforms distance methods. The same research on the NCBI site may use the whole
GenBank database with the option of suppressing “environmental samples”. This database
is then close to our “lax” database but this does not repair the absence of true phylogenetic
reconstruction, and in many situations the tree is overcrowded by very similar sequences.

The RDP web site also offers the possibility to load a query sequence, find the closest
neighbours in terms of 7-mer sharing percentage (by using Sequence Match) and to build
a phylogenetic tree (via Tree Builder). The databases provided by this service are known to be of high quality
and it is possible to restrict it to cultured bacteria, uncultured or both as well
as to TSs, non TSs or both. These selections thus correspond to our “lax” (selection
of cultured and non-cultured) or “superstringent” (selection of type-strain sequences
only) databases and to other, intermediate choices. The maximum number of matches
is limited to 20, but it is possible to select more closest taxa by another procedure
(Hierarchy Browser or Sequence Match), and to proceed to their phylogenetic analysis.
Alignment is done by the fast, rRNA secondary-structure aware Infernal aligner 94], and the phylogeny is obtained by distance methods such as NJ or Weighbor 95] with bootstrap support computation. At most 50 sequences can be put in the tree.
A good knowledge of bacterial taxonomy is required to select the more phylogenetically
related neighbours and to select a pertinent outgroup if wanted. Apart from these
requirements, the phylogeny obtained is subject to the intrinsic limitations of distance
reconstruction methods. This website requires numerous selection and data transfer
steps that are not needed in leBIBI
QBPP
The selection of the recruited sequences that will be used for alignment and phylogeny
is not needed in leBIBI
QBPP
, where this is done by choosing the reference database and the number of retained
matching sequences.

The last similar tool is provided by the Phylogeny.fr web site 96]. This system allows to perform a BLASTN search and then to compute a phylogeny on
a set of homologous sequences. The first main difference with leBIBI
QBPP
is the fact that the submission of several sequences is not possible. Also, the database
choice is limited to GenBank. Consequently, ill-identified sequences and large numbers
of nearly identical sequences are often recruited in the resulting phylogenetic trees.
In its simplest protocol, this web tool performs multiple sequence alignment computation
by Muscle, alignment trimming by GBlocks, phylogenetic reconstruction by PhyML and
tree rendering with TreeDyn 97]. Many options allow to customize this process. The main differences between the service
provided by Phylogeny.fr and the present tool is that leBIBI
QBPP
performs all its analyses in one step from the user’s viewpoint, and that its databases
are optimized for microbial phylogeny.

Case Studies

A short SSU rDNA gene fragment of an unknown bacterium was recently sequenced, and
studied in our laboratory. Using the “stringent” database with 50 recruited sequences
led to an unexpected phylogenetic tree with multiple warnings (Additional file 1). The interpretation was that the tree was unbalanced due to a large number of Mycobacterium tuberculosis sequences. M. lepromatosis was the closest species in terms of patristic and node distances. As this species
has not been validly published yet (this is denoted by the t after the name), the “superstringent” database could not be used. The chosen solution
was to increase the number of recruited sequences to 100. The resulting phylogenetic
tree was considerably improved (Additional file 2), especially through the recruitment of an outgroup sequence. The M. leprae cluster is phylogenetically positioned close to M. lepromatosis and M. haemophilum, as expected. The query sequence is that of a new species of Mycobacterium98]. This was confirmed by analysis of the rpoB sequence obtained from the same bacterial extract and the rpoB “stringent” database (Additional file 3).

A set of 44 partial SSU rDNA sequences (1200–1450 bp) have been obtained from bacteria
cultivated from filtrated ion-exchanged tap water. The 44 sequences have been processed
batch-wise by leBIBI
QBPP
with the “TS-stringent” database as reference in a five-minute run (Additional file
4). In most cases, the closest sequence to the query belongs to its proximal cluster.
Therefore, the taxonomic assignment of 41 strains was highly reliable according to
the criteria presented above. In three cases, the query was not clearly inside or
close to a cluster. These three sequences require further expertise as they may belong
to new taxonomic entities, species or genera.