Does conservation account for splicing patterns?

Alternative splicing, the production of multiple mRNA isoforms from a single gene, is critical to the generation of biological complexity and the differentiation of both tissues and species [1]. Consequently, there has been great interest in recent years in developing in silico models of the splicing code – the interactions of cis and trans regulatory elements – from simpler biological features such as genetic sequence, nucleosome positions and RNA secondary structure [2, 3]. Ideally, a splicing model should be able to make several types of predictions: the ‘absolute’ percent-spliced-in
? of any exon in various tissues, ?
? between tissues, the impact of mutations on ? [4], and binding sites for RNA-binding proteins (RBPs) that affect splicing [5]. Notably, none of these goals requires the model to actually mimic the inner workings of the cell, and most metrics used to evaluate the quality of a model’s predictions do not take into account its biophysical fidelity.

It has long been known that alternative splicing is associated with modified evolutionary conservation of both exons [6] and their flanking introns [7]. Modrek and Lee [6] found that newly created exons (those with non-conserved splice junctions) have low ? and hypothesized that this served a useful evolutionary purpose, by allowing the exon to accumulate beneficial mutations without the organism losing the benefits of the original protein in the meantime. Sorek and Ast [7] noted that alternatively spliced exons are disproportionately likely to have conserved flanking introns, and identified that one abundant k-mer in conserved downstream introns had known cis-regulatory properties.

The role of intronic conservation extends to tissue-specific splicing regulation as well. Sugnet et al. [8] found that exons with high ?
? between brain or muscle and other tissues tended to have highly conserved flanking introns. Yeo et al. [9] discovered that conserved Fox and Nova motifs in introns are associated with higher ? in brain tissue. Wang et al. [10] found that exons with ‘switch-like’ ?
? 0.5 between any pair of tissues have increased conservation in flanking introns.

Computational models of splicing often depend on conservation for their accuracy. A previous study on alternative splicing modelling [2] found that a metric of model quality increased by one-third when conservation was incorporated into the model. The most accurate existing models of alternative splicing [4, 11, 12] also rely heavily on conservation. These models train neural networks on over 1000 ‘hand-crafted’ features, including motif counts, position weight matrix (PWM) correspondences, sequence lengths, RNA secondary structure, nucleosome positions, and translatability and frameshift features. In these models, conservation is used both in raw form, as averages over the first 100 bp of each flanking intron (average conservation), and to weight motif counts. The underlying assumption is that conservation is mostly useful to indicate the overall level of cis elements in flanking introns (average conservation) and to determine which occurrences of interesting motifs are actually relevant for splicing (conservation-weighted motif counts).

This article introduces several computational models of splicing that depend on conservation, with the goal of understanding the evolution of alternative splicing. Some previous studies of alternative splicing and conservation [1315] analyze the conservation of alternative splicing patterns between species. Instead, we prefer to focus on the conservation of the sequence near alternative splice sites, as this incorporates flanking introns into the analysis and provides more fine-grained insights into the differing roles of conservation in various regions of the sequence.