Parametric bootstrapping for biological sequence motifs

The computational analysis of DNA, RNA and protein sequences is a cornerstone of bioinformatics, enabling the study of genomes and protein families and providing the scaffold for a broad range of algorithms used in the analysis of biological data [1]. At the molecular level, many biological processes rely on the recognition of specific sequence patterns, or motifs, that define specific interactions between biological molecules [2]. The ubiquity of these motifs has led to the proliferation of a vast array of bioinformatics algorithms dedicated to the discovery and study of these sequence elements and their evolution [28]. Transcription factors modulate gene expression by binding to DNA in the promoter region of regulated genes. This binding relies on the specific recognition of short (5–30 bp) DNA sequence motifs by the transcription factor (TF) and, therefore, the discovery and characterization of TF-binding motifs is essential to our understanding of transcriptional gene regulation [5, 9, 10].

The discovery of TF-binding motifs is based on the elucidation of statistically overrepresented sequence elements within a set of sequences known or suspected of harboring TF-binding sites (e.g. promoter regions of co-transcribed genes). Many algorithms for motif discovery have been developed over the years, but they can be broadly divided into word-based and probabilistic approaches [3]. Word-based methods rely on the enumeration of oligonucleotides [11], whereas probabilistic and machine learning approaches use models of varying complexity to represent TF-binding motifs, estimating model parameters through sampling or optimization techniques [2, 1215]. Central to these approaches is the definition of a robust statistical framework, a TF-binding motif model and its enhancement with heuristics based on knowledge of the underlying biochemistry. Most motif discovery methods, for instance, can enforce symmetry in TF-binding motifs to enhance performance when the TF is known to bind as a homodimer [16]. Similarly, the canonical position-specific weight matrix (PSWM) model for TF-binding motifs, which assumes positional independence in the motif, can be extended to accommodate variable spacing or positional dependencies [1719]. Determining the proper model for TF-binding motifs also plays a pivotal role in other aspects of their analysis, such as the search for TF-binding sites or the use of simulations to analyze TF-binding motif evolution [7, 8, 2022].

In principle, many properties observed in experimentally determined collections of TF-binding sites could be used to enhance algorithms involved in the discovery, search and evolutionary simulation of TF-binding motifs through the inclusion of heuristics or the adoption of expanded models. A principled introduction of such enhancements, however, requires that the properties of naturally occurring TF-binding motifs be contrasted with those of random ensembles of motifs matching some of their defining statistics. Indeed, the practice of comparing empirical data to the statistics of random ensembles is common in other fields such as complex network analysis [23] and systems biology [24]. Comparatively little attention, though, has been paid to the problem of defining such ensembles for biological sequence motifs and designing algorithms to sample them efficiently.

As a measure of the optimal mean message length required to encode samples from a probability distribution, information content (IC) serves as a unifying statistic of sequence conservation [25, 26]. Here we propose and characterize two different algorithms to sample from the set of DNA motifs matching a desired value of this most fundamental statistic.

We demonstrate their use by analyzing the informational Gini coefficient (IGC) of TF-binding motifs. Assuming that transcription factor binding motifs require a certain amount of information in order to effectively address their regulated genes, it is an open question how this information should be distributed among the positions of the motif.

Researchers have long noted disparity in the degree of conservation between the columns of prokaryotic transcription factor binding motifs [27]. A theoretical rationale for this disparity has been proposed based on the observation of sine wave-like patterns in motifs bound by multimeric transcription factors. Regions bound through direct readout by each TF monomer require a higher degree of conservation than “spacer” regions involved primarily in backbone contacts, leading to wave-like differential patterns of information content in collections of aligned sites [2830]. The variability in spacing between the monomer binding sites of different TFs (illustrated in Fig. 1) complicates the analysis of such patterns and the evaluation of their statistical significance. IGC measures the degree of departure from uniformity in the distribution of positional information content across a motif, without any assumptions on the particular shape of such distribution. IGC therefore provides a formal and generic statistical framework to analyze deviations from uniformity in the positional distribution of information of biological binding motifs, such as those imposed by multimeric binding.

https://static-content.springer.com/image/art%3A10.1186%2Fs12859-016-1246-8/MediaObjects/12859_2016_1246_Fig1_HTML.gif
Fig. 1

Distribution of IGC values. For prokaryotic and eukaryotic motif collections, the distribution of IGC is approximated by kernel density estimation. For each collection, the minimum, modal and maximum elements are depicted in sequence logos. Prokaryotic motifs: (a) OmpR, (b) LexA, (c) NtaC. Eukaryotic motifs: (d) NFAT, (e) REPO, (f) Macho-1

Importantly, the distribution of information across a motif is a global property of a motif (rather than a property of its columns or column-pairs) and, therefore, it cannot be analyzed via column-wise methods. Hence, the use of random ensembles constitutes, to our knowledge, the only means of rigorously assessing the distribution of information content in TF-binding motifs. Our results show that the degree of disparity in information content across positions, as measured by IGC, is significantly higher in transcription factor binding motifs than in null ensembles with matched IC, and that higher IGC is not consistently associated with motifs bound by multimeric TFs. This indicates that the higher unevenness in the distribution of information observed in biological motifs, as measured by IGC, is an intrinsic property of TF-binding motifs, suggesting that this statistic could be exploited as a signature of biological authenticity in applications such as motif discovery.