eRFSVM: a hybrid classifier to predict enhancers-integrating random forests with support vector machines

Transcription regulation in human genes is a complex process. Systematic and precise identification of these regulatory DNA elements, especially enhancers [17] is a prerequisite to understand gene expression in both healthy and diseased cells [9]. More and more studies indicate that mutations in enhancers are associated with human diseases, such as cancers [7, 15], cardiovascular diseases [21] and immunological diseases [27].

Enhancers increase the transcriptional output in cells manifesting distinct properties, which are summarized as follows: (a) enhancers are distal regulatory elements, which may locate 20kb or further away from transcription start sites, (b) they can activate gene transcription by recruiting transcription factors (TFs) and their complexes, (c) they may be enriched with chromatin modifications, such as monomethylation of histone H3 lysine 4 (H3K4me1) and the acetylation of histone H3 lysine 27 (H3K27ac), (d) they can initiate RNA polymerase II transcription, producing a new class of non-coding RNAs called enhancer RNAs (eRNAs) [1], (e) they are tissue specific and merely conservative functioning in different spaces and stages. Thus, single experimental validation of them seems to be a time-consuming and costly task. Predicting enhancers based on conservation analysis of genomic sequences also doesn’t work well [23]. With the development of high-throughput sequencing technologies, the advanced computational tools make this task possible in the big data era.

Machine learning algorithms [31] were used to predict enhancers with chromatin immune precipitation sequencing (ChIP-Seq) datasets [10, 30], such as the chromatin modification loci and the TF binding sites (TFBs) [24, 31]. Single classifiers used supervised learning algorithms, e.g., CSI-ANN [13] introduced an artificial neural network approach; RFECS [25] identified enhancers with RF; ChromaGenSVM [12] applied SVMs with a Genetic Algorithm (GA) to optimize the parameters of SVMs; EnhancerFinder [11] used multi-kernel SVMs to predict enhancers in the eukaryotic genome; DEEP [18] used an algorithm combined SVMs with ANN including the components of DEEP-ENCODE and DEEP-FANTOM5.

Some classifiers mentioned above, such as CSI-ANN, ChromaGenSVM, REFCS and EnhancerFinder used ChIP-Seq datasets as features, and were strongly relied on EP300, which was a transcriptional coactivator and could activate gene transcription by combining with TFs, considering EP300 binding sites as enhancers. However, EP300 binding sites are very possible enhancer sites, but not the 100 % real ones [19]. DEEP used the same datasets in predicting EP300 based enhancers as the classifiers mentioned above. It firstly used FANTOM5 datasets as training enhancers, which were reconstructed from enhancer RNAs (eRNAs) datasets using Cap Analysis of Gene Expression (CAGE) [14]. However, it used DNA sequence features as features not ChIP-Seq datasets in DEEP-FANTOM5. It used SVMs as base classifiers to train datasets from single tissues or cell lines and it used ANN as a main classifier to make the final decision combining the results of the base classifiers. It has been proved that getting the global optimum of ANN is a NP-hard problem. The simple implementing algorithm, back-propagation algorithm, is a heuristic algorithm, which is easy to trap in local optimal solution [26]. Thus, the weakness of DEEP was obvious. In the layer of algorithm, the predicting result was not the global optimum; in the layer of datasets, it didn’t use effective features in predicting eRNAs based enhancers.

In this study, we built a hybrid classifier eRFSVM including eRFSVM-ENCODE and eRFSVM-FANTOM5. We used RF as base classifiers [3, 4], which was a fast and easy- paralleled algorithm good at dealing with unbalanced datasets. For eRFSVM, we could get the global optimum when making the final decision for both RF and SVMs algorithms were P problems [29].

In the process of data pre-processing, to reduce unbalanced ratio, we used a sub-sampling algorithm, and combined it with the k-means method, comprehensively considering the running time for the program and the loss of information in samples. The base classifiers trained datasets from single tissues or cell lines with the RF algorithm, which used 60 % of the datasets for training and the remaining 40 % for testing. With the testing results of base classifiers, we built the main classifier with SVMs. For eRFSVM-ENCODE, we trained datasets of cell lines like Gm12878, Hep, H1-hesc and Huvec, and then tested datasets of Hela and K562 cell lines, with enhancers identified by transcriptional coactivator EP300 binding sets as labels. For eRFSVM-FANTOM5, we trained on datasets of blood, lung, liver, kidney, and then tested on the datasets of adipose, with eRNAs based enhancers as labels.