Inter-platform concordance of gene expression data for the prediction of chemical mode of action

Concordance experiments

We conducted three types of investigations for studying the performance of the proposed classifiers.

  1. 1.

    Train classifiers and make predictions on individual platforms.

  2. 2.

    Train classifiers in one platform to make predictions on the other platform.

  3. 3.

    Identify important variables (genes) for accurate classification.

In the 1st analysis, we explore the predictability of MOAs using various classifiers developed in the given training data. To our knowledge, there is no established criteria to define prediction for an unknown class that was not represented in the training data. Thus, we select an adjusted test set after eliminating all test samples belonging to two classes of “ER” and “HMGCOA”, where the new test was used in parts of 1st and 3rd analysis. However we also considered the originally given test set as a part of 1st analysis by adopting following alternative classification approach. Accordingly, first we designated both “ER” and “HMGCOA” samples belonging to the original test set as “OTHER”. For each classifier, then we determined the maximum class probability for a given test sample and if the above probability was less than 0.5 we selected the predicted class as “OTHER”, else kept the originally predicted class. For this purpose, class probabilities for the ensemble classifier was calculated using the predicted class proportions observed in the B bootstrap samples.

Our objective with the 2nd analysis was to examine the inter-platform concordance between microarray and RNAseq platforms. Thus, we trained classifiers on a selected platform using the full dataset that included the both given training and test sets for making predictions on the other platform. However, since the classifier needed to run on both platforms for this analysis, each gene expression measurement was standardized, separately for both platforms, prior to the analysis.

For analyses 1 and 2, we selected an ensemble classifier developed with a set of M=7 standard classifiers, SVM, RF, LDA, PLS+RF, PLS+LDA, PCA+RF, PCA+LDA, and Recursive Partitioning (RPART). Primarily, classifiers are selected based on the prior information of their suitabilities in high dimensional data classification. Based on accuracies of predicted classes, each classifier was ranked for K number of performance measures (for example, overall accuracy, class specific accuracies ect.). Since the selection of performance measures for a multi-class classification problem is highly depend upon the aim of study; we optimized the overall prediction accuracy, and the class specific accuracy of each group for the 1st analysis. Furthermore we considered these performance measures to be equally important for classification (i.e., we used equal weights of w
i
=1, in Eq. (1)), whereas in the 2nd analysis in cross platforms, we focused only on the overall accuracy without optimizing multiple group specific performances. For these analyses, we chose B to be B=300. We performed a 10 fold cross-validation for each individual classifier to select the number of components for PLS and PCA methods, separately for two platforms. Assuming consistent performance in bootstrap samples similar to the original training data, we employed the same number of components to develop the ensemble classifier.

The 3rd analysis on identifying important variables is subdivided into following two parts.

  1. 1.

    Detecting important genes with the adjusted test set.

  2. 2.

    Detecting important genes with full data using the cross-validation method.

We applied a classifier on the perturbed training data resulted from randomly permuting gene expressions of a given gene to quantify its impact on the predictability of MOAs in a test set. Accordingly, each gene was ranked by a measure given by magnitude of accuracy reduction compared to the true accuracy (in unpermuted data), such that the rank 1 corresponds to the gene that has the highest negative impact on the overall prediction accuracy. In order to reduce the computational burden, we did not use the ensemble classifier for this purpose. Instead the component classifier PLS+LDA which had an overall accuracy close to that of the ensemble classifier was used. We performed theses analysis separately for both platforms to determine a common set of genes presented among the top 20 genes in both platforms.

For Analysis 3.1, we randomly permuted a gene’s expressions in the training set and then made predictions for the test set (adjusted test set) using the classifier trained on the permuted training data. The permutation procedure was repeated l times for each gene to calculate an average overall prediction accuracy (A). Finally, genes were ordered by A, ascending order. Here we chose l to be l=30 in order to achieve reasonably stable approximation, while keeping the computational costs in check.