Mirinho: An efficient and general plant and animal pre-miRNA predictor for genomic and deep sequencing data

May 28, 2015

Dataset

To explore a variety of data types from several species, we used six datasets. The
sRNAseq data and the genomes were obtained from the NCBI. The annotations concerning
the known (pre-)miRNAs were obtained from MIRBASE release 20 6]. The datasets are organised as follows:

D1: the chromosomes of six metazoan species:

Chromosome 25 from Bos taurus (27 miRNAs)

Chromosome I from Caenorhabditis briggsae (14 miRNAs)

Chromosome 2R from Drosophila simulans (36 miRNAs)

Chromosome 25 from Gallus gallus (6 miRNAs)

Chromosome 22 from Gorilla gorilla (8 miRNAs)

Chromosome 19 from Mus musculus (60 miRNAs)

D2: the genomic sequence of length 5,000nt extracted from the human chromosome 1
and taken from the dataset used by 7].

D3: the artificial dataset compiled by 8]. This dataset is composed of 168 pseudo pre-miRNAs and 163 true human pre-miRNAs.
The pseudo pre-miRNAs are sequences that form a hairpin structure; however, they are
not functional because they are located in coding sequence regions (CDS). This dataset
is available at http://bioinfo.au.tsinghua.edu.cn/mirnasvm/.

D4: an artificial dataset compiled by us. For each of the three chromosomes listed
below:

Chromosome III of Caenorhabditis elegans (44 miRNAs)

Chromosome 2R of Drosophila melanogaster (92 miRNAs)

Chromosome 19 of Homo sapiens (234 miRNAs)

10 miRNAs were randomly chosen together with 100nt both up and downstream. Each fragment
(miRNA + extension) was flanked by sequences of the same length, which were generated
based on the nucleotide distribution of the given chromosome. In the end, we obtained
three different sequences of ? 4,265nt that were given as input to CSHMM, MIRENA, MIRINHO, and miRPara (see Section â€œCompared methodsâ€).

D5: genomic data of three insects that are of special interest for our group:

Acyrthosiphon pisum genome assembly version 2 (123 miRNAs)

Culex quinquefasciatus genome assembly version 1 (120 miRNAs)

Heliconius melpomene genome assembly version 1.1 (101 miRNAs)

D6: genomic and sRNAseq data of plants. We used the sequence of chromosome 4 of Arabidopsis thaliana (version 2.0) and the whole genome of Arabidopsis lyrata (version 1.0), as well as the high-throughput small RNA sequencing data from the
same species with, respectively, NCBI GEO accession number GPL3968 and GSE18077/GSE20442.
The chromosome 4 from A. thaliana and the genome of A. lyrata contain, respectively, 57 miRNAs and 384 miRNAs.

Pre-processing input sequences with small RNA sequencing data

When sRNAseq data are available, they may be used to pre-process the input sequence,
focusing the search only on transcribed regions. We implemented this functionality
in a separate module, called PROCIN, that is optionally performed before the core algorithm for the prediction of pre-miRNAs.

First, the small RNA reads are mapped against the genome using BOWTIE 9]. The genome sequence and the respective mapping file (extension.sam) produced by
BOWTIE are then given to PROCIN. Using the SAM file, the regions with at least one read are extracted from the genome
sequence. If the region is smaller than 70nt (approximate length of a pre-miRNA),
we extend it by 70nt up and downstream to guarantee that the whole pre-miRNA sequence
will be covered. Both values â€“ number of reads covering the region and length of this
region â€“ may be given as parameters to this module.

For each (extended) pre-miRNA sequence, the lengths of the stem-arm and of the terminal
loop are set as follows (see Figure 2 for an example). The highest read stack is identified, either on the 5â€™ or on the
3â€™ arm. If this stack is on the 5â€™-arm, the end position i of the right-most aligned read is considered and the length of the stem-arm is set
to l=i+1. If the stack is on the 3â€™-arm, p is the length of the pre-miRNA, and j is the start position of the left-most aligned read, the length of the stem-arm is
set to l=p?j+1. The stem-arm is then â€œmirroredâ€ to the opposite stem-arm and the distance between
the two arms is the maximum length of the terminal loop. One should notice that reads
aligning to the terminal loop could mislead this procedure. We solved this problem
when establishing the lengths by disregarding all the reads that aligned to the middle
coordinate of the putative pre-miRNA, in an attempt to focus only on the reads aligning
to the arms.

Figure 2. Using sRNAseq data to set the parameters of MIRINHO. The solid lines are the reads that align to the pre-miRNA sequence represented by
the dotted line (the dots are the coordinates of the pre-miRNA sequence). The point
in red is the middle coordinate of the pre-miRNA that serves to eliminate reads that
align to the terminal loop and may mislead the procedure of determining the parameters.
l is the length of the stem-arm and t is the maximum length of the terminal loop.

One should notice that other molecules, such as rRNAs and tRNAs that may have a read
stack similar to the one of a miRNA, can be recovered by MIRINHO if they fold themselves into stable stem-loops. Since rRNAs and tRNAs are molecules
that are usually well annotated, a simple solution is to mask these regions before
running MIRINHO, for example with MASKFASTA from the BEDTOOLS package 10].

Screening the genome to identify potential pre-miRNAs

The algorithm of MIRINHO begins with a pre-treatment of the data in order to identify the input for the alignment
step. This is done without loss of information at this step, meaning that all the
positions in the genome are verified to determine whether they may contain a pre-miRNA.
This is done by screening the genome as follows.

As one may see in Figure 1, we set a sliding window of length l=25, which represents the length of the stem. For each stem-arm represented by st₁=[w_i,â€¦,w_i+l?1], we look for its putative stem-arm pair st₂=[w_i+l+n?1,â€¦,w_i+2?l+n?1]n nucleotides away from the first one, with n between 5 and 20. In order to set the length of the stem, we consider the mean length
of a miRNA (i.e., 22nt), and to ensure that the whole miRNA is considered in the alignment,
an offset of 3nt, equivalent to the standard deviation (i.e., ?=3.33) of the lengths of the miRNAs in MIRBASE (release 20), was added. Using this strategy, we guarantee that all the miRNAs that
are smaller than 25nt will be considered. The lengths of the terminal loop are set
on the basis of the distribution of the loop lengths for metazoan pre-miRNAs (MIRBASE release 20) as one may see in Additional file 1: Figure S11. Each pair (l, n) will be an input for the alignment algorithm.

Assessing the potential pre-miRNAs

Each pair of putative stem-arms screened in the step described in the previous section
is given to an alignment algorithm in order to evaluate whether it can form a stable
stem-loop structure. For that, we implemented the Needleman Wunsch global alignment
algorithm 11] with a scoring strategy based on the Nearest-Neighbour energy model. Instead of using
the sum of the integer penalties for gaps, matches and mismatches, the alignment is
assessed according to the free energy of each two consecutive nucleotides in the alignment.
This is explained in more detail next.

Nearest-neighbour model

To model the free energy change for the folding of these RNAs, one can use the thermodynamic
Nearest-Neighbour (NN) energy associated to each type of motif in the structure. By
summing up the energy increment of each motif, it is possible to obtain a reasonable
approximation of the free energy change for folding an RNA or, in other words, to
obtain a measure of the stability of an RNA molecule 12],13].

The motifs forming an RNA structure are determined by the base-pairs AU, GC and GU.
The arrangement of these base pairs can shape into the different types of motifs,
such as helices, bulge loops, and internal loops. The stabilising motifs are: the
Watson-Crick helix represented by the stacking of at least two base pairs; and a dangling
end which is a single base at the end of a helix. The destabilising motifs are of
three types: the hairpin loop which is composed of non-canonical base pairs closed
by one canonical base pair; the bulge loop which is an arrangement of unpaired nucleotides
in one of the strands of a helix; and finally, the internal loop which includes unpaired
nucleotides in both strands of a helix. There exist three more types of destabilising
motifs: exterior loop, pseudoknot, and multibranch. The first two are not present
in a pre-miRNA stem-loop structure and will therefore not be explored in detail here
13],14]. The third one (multibranch loops) may occur in the terminal loop of longer hairpin
sequences (such as plant pre-miRNAs). Pre-miRNAs with such a kind of motif will be
recognised if the hairpin is stable enough. However, it will not be considered in
the computation of the free energy because our method only uses the stem-arms to calculate
such energy and the terminal loop to separate these two arms.

As mentioned before, to compute the free energy of an RNA structure, it is necessary
to sum the increments according to the type of the motif. The equations presented
hereafter describe how to compute the free energy associated to each such type.

The energy of a dangling end depends only on the base-pair before the dangling nucleotide
and on this latter. For all the other types of motifs, the equations are given below.
The energy of an internal loop is computed by means of Equation 1:

$?GInternal=?Gi(n)+(?Ga?|n1?n2|)+?Gm1+?Gm2+(?Gru??)$ (1)

where ?G_i(n) is the initiation energy to form an internal loop of n?30 unpaired nucleotides; ?G_a=0.6 is the asymmetry penalty multiplied by the absolute value of the difference between
the number of unpaired nucleotides in each strand; ?G_m1 and ?G_m2 are the energy of the first and the last mismatches in the internal loop; and ?G_ru=0.7 is the penalty for an RU closure, where R={A,G} and ? is the lambda function which returns 0 or 1, corresponding, respectively, to the
presence or absence of, in this case, the RU closure.

For the bulge loops, one should use Equation 2:

$?GBulge(n=1)=?Gi(1)+?GC+?Gs?RTln(t)+(?Gru??)?GBulge(n1)=?Gi(n)$ (2)

where ?G_i(n) is the energy required to form a bulge with n?30 unpaired nucleotides; if the bulge is comprised of the nucleotide C only, and
there is at least one more C not in the bulge (meaning, it is paired with a G), one
should add the C bulge penalty ?G_C=?0.9 kcal/mol; ?G_s is the base pair stacking around the bulge; t is the number of possible loop conformations with identical sequence; R=8.3144621 J/mol K is the gas constant and T=310.15 K is the temperature in kelvin. Notice that for bulges and helices, ?G_ru=0.45 and is referred to as the penalty for an RU end (and not closure as for internal
loops).

For bulges and internal loops larger than 30 nucleotides (n30), Equation 3 should be applied instead:

$?Gn30=?Gi(30)+1.75Ã—RTÃ—ln(n/30)$ (3)

Finally, for a helix, one should apply Equation 4:

$?Ghelix=??Gstck+?Gsym+(?Gru??)$ (4)

where ?G_stck is the stacking energy of each two consecutive base pairs; ?G_sym is the symmetry correction for self complementary duplexes; and ?G_ru=0.45 is, as mentioned before, the RU end penalty.

All the thermodynamic NN energies used in this work, as well as the equations described
above, were obtained in the NEAREST NEIGHBOR DATABASE (NNDB) 14],15].

Algorithm

We implemented a global alignment algorithm (see the Section â€œNeedleman-Wunsch algorithmâ€
below) and an alignment assessment approach based on the NN energy model to measure
the stability of pre-miRNA candidates, that is explained below.

We define the alphabet ?={M_xy,S_xy,I_xy,D_xy}, where the symbols correspond, respectively, to Match, Mismatch, Insertion and Deletion, and x,y?{A,U,C,G,?}. The definition of an alignment of two putative stem-arms, st₁ and st₂, is a vector comprised by the symbols in ?, such that align(st₁,st₂)=v and v= [v_i,v_i+1,â€¦,v_n], where v_i??.

To determine the stability of a pre-miRNA stem-loop, we go through vector v and sum up the free energy of each pair (v_i,v_i+1) according to the type t of the motif in which it is inserted. For that, we use Equation 5 below to compute the energy of each motif in the structure:

$?(t)=k(t,n)+?in?1e(vi,vi+1)$ (5)

where t is the motif type that can be an internal loop, a bulge loop or a helix. The value
k(t,n) accounts for penalties associated to the motif t, which appears n times in the structure. For example, for a motif of type t=helix, one should consider the symmetry correction for self-complementary duplexes ?G_sym (see Equation 4). Finally, the function e returns the energy associated to the pair (v_i,v_i+1).

We then sum all the energies related to the different types of motifs to obtain the
total free energy E of the structure using Equation 6:

$E=??(t)$ (6)

where t is again the different types of motifs a given structure can have.

Needleman-Wunsch algorithm

Since the aligned sequences are similar in length, we implemented the DP global alignment
algorithm described by Needleman and Wunsch, which will try to align every nucleotide
in the sequences. The recurrence for this algorithm is presented in Equation 7:

$W(i,j)=maxW(i?1,j?1)+f(si,sj)W(i,j?1)+?W(i?1,j)+?$ (7)

where f(s_i,s_j) is the function returning the score or penalty for, respectively, a match or a mismatch,
and ? is the penalty for a gap. Using this recurrence one should take, in the worst case,

O(n2)

time to align two sequences of length n11].

Considering that a stable hairpin structure should not contain very large bulges nor
internal loops, an ideal alignment should be concentrated around the main diagonal
of the dynamic programming (DP) matrix. Instead of using the whole matrix, the user
can therefore constrain the alignment to this diagonal and prune parts of the bottom-left
and top-right corners of the matrix, thus saving time in the computation of the free
energies with a small loss.

A parameter dw (diagonal width) is established that depends on the length of the aligned sequences
and on a compromise between sensitivity and precision in relation to the version that
uses the full matrix (see the Section â€œComputing timeâ€ to determine how to set an
appropriate value for this parameter). This parameter is associated to the number
of consecutive gaps (i.e., the length of the bulges) the alignment may have.

RNA secondary structure prediction algorithm

To facilitate the comparison of the complexity of each algorithm, we present here
a description of the method for predicting an RNA secondary structure. The first such
method is the one that was described by Nussinov 16]. The algorithm proposes a maximisation of the number of base pairs to find the best
structure. For each position i in the sequence, one should verify all the possible cases: (a) i,j base pair; (b) i is unpaired; (c) j is unpaired; (d) i,j base pair with, respectively, k and k+1. The recurrence for this algorithm is presented in Equation 817]:

$E(i,j)=maxE(i+1,j?1)if i and j base pairE(i+1,j)E(i,j?1)maxikj[E(i,k)+E(k+1,j)]$ (8)

Clearly, filling each cell in the DP matrix takes
O(n)

time, and since there are
O(n2) cells, the complexity for the whole procedure is in
O(n3). However, maximising the number of base pairs is a naÃ¯ve approach; a more realistic
one is to minimise the free energy of the structure, for example as proposed by Mathews
12]. The recurrence for the latter algorithm is presented in Equation 9:

$E(i,j)=minE(i+1,j)E(i,j?1)minikj[E(i,k)+E(k+1,j)]P(i,j)if i and j base pair$ (9)

To minimise the free energy, one more table P is required to store the different types of motifs a structure can have, although
the complexity in the worst case remains the same, namely in
O(n3)

12],13].

Setting the energy threshold

To determine an appropriate energy threshold for the prediction of pre-miRNAs, three
approaches were tested. The first two were ROC (Receiver Operating Characteristic)
curves, one using the insect genomes from dataset D5 (Additional file 1: Figure S3) and the other the pre-miRNA dataset D3 (see Additional file 1: Figure S4). In the third approach, we used a set of random genomes generated according
to a multinomial distribution and compared the pre-miRNAs predicted in such genomes
against those predicted in the original metazoan genomes from dataset D1. The reasoning
behind this strategy is that, if the energy model is robust enough, there should exist
an energy threshold that is able to differentiate the stable hairpin structures from
the randomly generated ones in which the base pairs would be established by chance.

The nucleotide frequency distribution in the original metazoan genomes was used to
generate the respective random genomes. After that, the prediction of the pre-miRNAs
was performed on both versions (original and random) of each genome. We then chose
as threshold the biggest energy for which the number of true miRNAs remains zero in
the random genome, as can be seen in Figure 3 and Additional file 1: Figures S5, S6, S7, S8 and S9. To define a â€œtrueâ€ miRNA in the random genome, we
considered only if its coordinates were the same as those of a true miRNA in the original
genome from which the random one was created. Using this approach, the selected genomes
had the thresholds presented in Table 1.

Figure 3. Setting energy threshold. Number of TP pre-miRNAs predicted when using the original
and the random genomes of M. musculus. The vertical line represents the energy (?19 kcal/mol) that better distinguishes
true from false pre-miRNAs.

Table 1. Computing time comparison

In the first approach, the best operating point was providing an acceptable sensitivity
and a higher specificity, which was being overestimated due to the inability to define
a true negative pre-miRNA at a genomic scale (see Section â€œD5: insect genomesâ€). Moreover,
the threshold (e=?16.3) established with this approach was leading to a very weak precision. We then
proceeded to use the second and third approaches which produced more reasonable results,
meaning having good sensitivity and precision. The thresholds set with the latter
approaches are very similar, e=?21 kcal/mol for the second and e=?20.6 kcal/mol for the third. The default energy threshold is set to ?20.6 kcal/mol,
that is the mean of the energies mentioned in Table 1.

Once the secondary structure is computed by MIRINHO, we may verify if the reads aligned on the locus mimic the constraints consistent
with Dicer processing by using an a posteriori script called CHECKDICER. We check the position of the reads within the hairpin (on the mature, star or terminal
loop), allowing facultative read overlaps between these three regions. Such overlaps
may cause false positives. At the same time, directly discarding the candidates with
overlapping reads may also result in false negatives. To thus allow a flexible tradeoff
between FP and FN, the user may provide two parameters to CHECKDICER: (i) the percentage of the overlap for each read; and (ii) the percentage of the
number of reads overlapping. The user may thus choose to be strict with Dicer processing
by setting the first parameter to zero (risk of FN); or to allow small percentages
for one or both values (risk of FP). It is important to specify that the Dicer processing
validity check implemented in CHECKDICER has not been used in any of the analyses that were performed.

Comparison with other methods

Compared methods

To compare the accuracy of our method with the one of other predictors, we first made
an extensive search of the available ones (see Additional file 1: Table S1). We put aside the predictors that required other kinds of input files
than just the fasta sequence, as well as those incompatible with the Unix system.
Web-servers were also disregarded because there always is a restriction to the length
of the sequence that may be input. The methods that remained were CSHMM, MIRENA, and MIRPARA. Notice that as one of our main contributions is the efficiency in the prediction
of pre-miRNAs in relation to other methods that use cubic complexity algorithms, it
was natural to compare MIRINHO to methods that adopt this kind of algorithm. However, we also included in the comparison
a method such as CSHMM which does not use a cubic algorithm for the prediction of miRNAs.

Since the set of input parameters differs for each method, it is not a trivial task
to set them in accordance with the data and at the same time be fair in the comparison.
We thus decided to apply the methods with default parameters. However, we adapted
one aspect that was common to all the methods: the set of known (pre-)miRNAs. All
the methods were trained, when required, with animal (pre-)miRNA sequences. A description
of each method, and how each was set up is given below.

CSHMM uses a Context-Sensitive Hidden Markov Model to predict pre-miRNAs 7]. To set the initial parameters for CSHMM, we used the secondary structures of the kingdom metazoan that are available in MIRBASE release 20. To generate the likelihood score, the metazoan hairpin sequences were
used as positive training set, and as negative instances the sequences employed by
the authors.

MIRENA applies a series of consecutive filters to determine a pre-miRNA and does not require
a training step 18]. It provides different starting points for the prediction depending on the type of
file given as input. We set the parameter to allow for a genomic input (-M option).
The set of known mature miRNAs required was from the same metazoan kingdom, taken
from MIRBASE release 20.

MIRPARA (version 6.2) is an SVM implementation trained with sequences from MIRBASE 19]. It makes available a script to generate the model according to the MIRBASE release and to the desired organism(s)/clade. In our case, we chose the model trained
with the metazoan pre-miRNAs of MIRBASE release 20. It is worth observing that among the compared methods, MIRPARA is the only one that also predicts the position of the miRNA within the pre-miRNA.

To compare the performance of our method when using sRNAseq data, we used MIRDEEP which is a method for the discovery of miRNAs from deep sequencing data 20]. To obtain potential pre-miRNAs, the authors use information of the mapped reads
against the genome. Pre-miRNAs with an unlikely structure are discarded and the core
algorithm computes a probabilistic score related to the structure and to the signature
of the pre-miRNA candidate. It is worth reminding that the core algorithm of MIRINHO only requires a fasta sequence; sRNAseq data, when available, are used to pre-process
the input sequence by considering only the mapped regions, as is described in Section
â€œPre-processing input sequences with small RNA sequencing dataâ€.

To analyse the quality of the predicted structures, we used RNAFOLD 21] and MIRNAFOLD 22]. The first is a classical method for predicting an RNA secondary structure through
energy minimisation. The second is a method for predicting a hairpin structure that
takes into account specific criteria (such as the length of the stem, the percentage
of nucleotides, the size of the terminal loops) related to known hairpins from MIRBASE, and verifies if these are present in the query structure. MIRNAFOLD is moreover, as far as we know, the only other method that has quadratic complexity
for predicting a pre-miRNA structure. For more details on how each of the methods
were used, see the next section.

Pre-miRNA hairpin structure in plants

To evaluate the performance of each method in the prediction of pre-miRNAs, we used
as measures sensitivity, precision and specificity. The first is the proportion of
true pre-miRNAs that are correctly predicted, the second is the fraction of predicted
pre-miRNA candidates that are real pre-miRNAs, and the third is the proportion of
false pre-miRNAs that are correctly excluded:

$Sensitivity=TPTP+FN$ (10)

$Precision=TPTP+FP$ (11)

$Specificity=TNTN+FP$ (12)

where TP stands for True Positive, TN for True Negative, FP for False Positive, and
FN for False Negative.

To compute the number of true pre-miRNAs predicted by each method, we did the following.
For a given species, there is a control set C={c_j,c_j+1,â€¦} of miRNAs that are considered to be true miRNAs according to MIRBASE, where j?{1..n} and n is the number of true miRNAs for a given species. For each predicted pre-miRNA, denoted
by ppm, we verified whether one of its arms fully covers a control miRNA, denoted by cm_j. If that was the case, we accounted for one TP. If the same ppm covered more than one control miRNA, we considered just the one with the best prediction
score according to each method. It is worth observing that a large proportion of the
miRNA sequences in MIRBASE are not supported by experimental evidence. However, it is considered as a reference
database.