Meta-analytic support vector machine for integrating multiple omics data

Over the last decade, the technologies of microarray and massively parallel sequencing generate multiple omics sources from a large cohort at an unprecedented rate. Besides, since the experimental costs have dropped, a huge amount of data sets have been accumulated in public data repositories (e.g., Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA)). And yet low reproducibility has been a chronic concern due to mid-and-small size of each individual experimental unit (e.g., 40–100) and low signal-to-noise ratios of genomic expression data [24, 26, 27]. In an effort to tackling these challenges, effective data integration methods have been widely spotlighted in biomedical research [2]. The traditional meta-analysis integrates significance levels or effect sizes of similar data sets (similar design or biological hypothesis), and has proven to be effective in discovering significant biomarkers [14, 37]. Multi-study data integration is also known as “horizontal meta-analysis” that combines multiple homogeneous omics data [38]. Moreover, many large consortia such as the Cancer Genome Atlas (TCGA) and Lung Genomics Research Consortium (LGRC) have generated different types of omics data (e.g., mRNA, methylation, CNV and so on) using samples from a single cohort. Datasets are aligned vertically by samples, and thus integration of such multi-omics data is called “vertical omics integrative analysis” [38]. Jointly leveraging multi-layers of omics data, vertical omics integration facilitates deciphering biological processes, capturing the interplay of multi-level genomic features, and elucidating how a priori knowledge of biological information (e.g., pathway database) functions within the framework of systems biology.

Generally high-throughput microarray and sequencing data have been extensively applied to monitor biomarkers and biological processes related to many diseases [4], to predict complex diseases (e.g., cancer diagnosis, [36]), prognosis [45], and therapeutic outcomes [23]. In particular, the recent classification and prediction tools have notably advanced the translational and clinical applications (e.g. MammaPrint [43]), Oncotype DX [30] and Breast Cancer Index BCI [49]. In this trend, the support vector machine (SVM) has been also popularly applied to many genomic applications and proved as one of the most powerful prediction methods [3, 15, 29] attributed to unmatched flexibility of non-linear decision boundary. Commonly gene selection (a.k.a. feature reduction) pertaining to outcomes diminishes the dimension of expression data, enabling to shorten the training time and to enhance interpretability. In addition, gene selection removes a large number of irrelevant genes that potentially undermine precise prediction, and notably the idea of feature selection using SVMs can extend to the setting of multi-omics data analysis ([18, 25]). As this concern related, many researchers have put tremendous efforts to circumvent low accuracy of the SVMs when analyzing high-dimensional genomic data. For instance, Brown et al. [5] introduced a functional gene classification including the usage of various similarity functions (e.g., kernels modeling prior knowledge of genes). Moreover, as SVM takes on the small subset of samples that differentiate between class labels with an exclusion of the remaining samples, it is believed to have the potential to handle large feature spaces and the ability to identify outliers. Guyon et al. [9] also proposed a gene selection method that utilizes the SVM based on Recursive Feature Elimination (RFE) recursively removing insignificant features to increase classification performance. In spite of SVM’s outstanding fortes in many applications, the current SVMs are only focused towards single data analysis, and so inevitably run into the problem of low reproducibility. To address this problem, we propose a meta-analytic framework based on the support vector machine (Meta-SVM). The proposed Meta-SVM is motivated by the recent meta-analytic method exploiting the meta-analytic logistic regression (Meta-logistic; [22]). To our best knowledge, no method has been introduced, which extends the SVMs to combining multiple studies in a meta-analytic fashion. Related to this, we develop a novel implementation strategy in spirit of Newton’s method to estimate parameters of the Meta-SVM. It is commonplace that the objective function of SVM is formed with the hinge loss and a range of penalty terms (e.g., L
₁-lasso, group lasso and etcs). Importantly we, however, adopts the sparse group lasso technique (i.e., both L
₁-lasso and group lasso, simultaneously) to capture both common and study specific genetic effects across all studies. The proposed method, on this ground, achieves the identical purpose of rOP [41] and AW [21], meta-logistic [22] whose feature selection allows to detect specific effects. In genomic applications, it cannot be emphasized enough that data integration analysis has proved its practical utility and has become commonplace to identify key regulators of cancer. Thus, many have paid attention to credible validation strategies that build on multiple studies [7, 35]. Besides, meta-analysis essentially aids to adjust tissue specific effects possibly distorting the analysis of individual datasets [21]. The optimization strategy to estimate, therefore, focuses on how to maneuver these two terms (L
₁-lasso and group lasso) in the formula. To overcome some of known traditional optimization rules (e.g., linear and quadratic programming), which mostly entails heavy computing tasks, we propose an approximation method to relax computational complexity in favor of concise implementation. The idea is to approximate the hinge loss including but not limited to penalty terms by a quadratic form, and thereby we can apply the classical coordinate descent algorithm to optimize the whole objective function.

The paper is outlined as follows. In Methods section, we introduce the meta-analytic method that builds on the support vector machine (Meta-SVM) and its implementation strategy at length. Simulation studies section shows experimental studies to benchmark performance of feature detection under various experimental scenarios. In Applications to real genomic data section, we demonstrate the advantages of Meta-SVM in two real data applications using publicly available omics data, and concluding remarks are presented in Concluding remark section. An R package “metaSVM” is publicly available online at author’s github page (https://sites.google.com/site/sunghwanshome/).