RefSelect: a reference sequence selection algorithm for planted (l, d) motif search

Motif discovery, as a main means to locate conserved fragments in biological sequences, is a fundamental problem in computational biology. The conserved fragments usually have special biological significance. For example, transcription factor binding sites in DNA sequences [1, 2] play a key role in gene expression regulation and they usually range from 5 to 25 base pairs; short protein sequence signatures [3, 4], which usually range from 10 to 36 residues, can be used in identifying potential interaction sites of proteins.

The planted (l, d) motif search (PMS) [5] is a famous formulation for motif discovery: given a data set D?=?{s
1, s
2, …, s
t
} with t n-length sequences over an alphabet ?, q satisfying 0??q???t, and l and d satisfying 0???d??l??n, the goal is to find one or more l-length strings m such that m occurs in at least q sequences in D with up to d mismatches. The string m is called a (l, d) motif, and each occurrence of m is called a motif instance. Finding all (l, d) motifs present in the input sequences is NP-complete [6].

In the PMS problem, the value of q implies the corresponding sequence model of motif discovery (i.e., the distribution of motif occurrences in the input sequences). The usual sequence models include OOPS, ZOOPS and TCM [7], representing that each input sequence contains one occurrence, zero or one occurrence and zero or more occurrences, respectively. When q?=?t and 0??q??t, the PMS problem corresponds to the OOPS and ZOOPS or TCM sequence models, respectively.

There have been numerous motif discovery algorithms [8, 9]. They are either approximate or exact, based on whether the algorithm can always find all motifs or the optimum motif. The approximate algorithms usually adopt probability analysis and statistical methods. For instance, the two classical algorithms MEME [10] and Gibbs Sampling [11] identify motifs by using expectation maximization and Gibbs sampling techniques, respectively. In general, these approximate algorithms can solve the problem in a short time but cannot guarantee global optimum.

In this paper, we mainly focus on exact motif discovery algorithms, which can find all (l, d) motifs by traversing the whole search space. The main indicator to assess exact algorithms is time performance, and researchers usually compare exact algorithms on the challenging PMS problem instances, for which the expected number of random (l, d) motifs present in the sequences is more than 1 [12]. For the exact algorithms proposed in earlier years, such as WINNOWER [5], DPCFG [13] and RecMotif [14], their search space is composed of (nl?+?1)t
possible alignments of motif instances. In recent years, the exact algorithms verify all the l-length patterns in the O(|?|l
) search space, and then output the patterns with motif property; we call them the pattern-driven PMS algorithms [1524].

The pattern-driven PMS algorithms have better time performance than other exact algorithms so far in identifying both short motifs and long motifs with weak signal. Their basic idea is to generate candidate motifs by using several reference sequences in the input, and then verify each candidate motif one by one. Specifically, they generate candidate motifs by using all possible h-tuple T?=?(x
1, x
2x
h
) composed of h l-length strings coming from h distinct reference sequences. In existing pattern-driven PMS algorithms, h is 1 for PMSP [15] and PMSPrune [16]; h is 2 for StemFinder [17], PairMotif [18], qPMS7 [19] and TravStrR [20]; h is 3 for iTriplet [21] and PMS5 [22]; for PMS8 [23] and qPMS9 [24], h is greater than or equal to 3 and self-adaptive in dealing with different PMS problem instances. Moreover, these algorithms use k?=?tq?+?h reference sequences to generate candidate motifs, ensuring that there exists at least one h-tuple T so that each l-length string in T is a motif instance.

Although pattern-driven PMS algorithms outperform other exact algorithms, most of them use the first k sequences in the input as reference sequences, without considering the effect of different reference sequences on time performance. In this study we found that given a data set, different reference sequences may lead to quite different number of candidate motifs, especially for large alphabets. So, in dealing with different inputs with the same scale, the pattern-driven PMS algorithms may exhibit sharp fluctuations in running time. For instance, we randomly generate multiple groups of data sets with |?|?=?20, t?=?20, q?=?20 and n?=?600 following the method described in the results and discussion section. When solving the (19, 9) problem instance, qPMS7 sometimes consumes 6.1 minutes, but sometimes over 48 hours. Some other pattern-driven PMS algorithms like TravStrR and PMS8, suffer from the same problem (see Supplement 1 for more examples).

To solve this problem, we propose a method named RefSelect to quickly select reference sequences that generate a small number of candidate motifs. RefSelect can bring a practical time improvement of the state-of-the-art pattern-driven PMS algorithms, without doing any modifications to them.