Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping


We obtained isoform-level gene expression estimates and molecular subtype information
for 342 and 155 GBM samples profiled by Affymetrix exon-arrays and RNA-seq, respectively.
The four molecular subgroups are neural (N), proneural (PN), mesenchymal (M) and classical
(CL). Gene expression profiles for 76 (18 are N; 22 are PN; 16 are M; and 20 are CL
subtype) samples were available from both RNA-seq and exon-array platforms. The common
samples were used to assess classification performance for platform transition. We
followed the data pre-processing procedure and obtained patients’ GBM subtype information
(class labels) from our recent study 14]; briefly, we downloaded the unprocessed Affymetrix exon-array dataset of 426 GBM
samples and 10 normal brain samples from TCGA data portal (https://tcga-data.nci.nih.gov/tcga/); obtained the isoform expression of 114,930 transcript variants (equivalent to
35,612 genes) using the Multi-Mapping Bayesian Gene eXpression program 33]. The estimated expression values were then normalized across the samples, using the
locally weighted scatterplot smoothing algorithm 34], a non-parametric regression method. To select 2,000 most variable transcripts, we
applied Pearson’s correlation coefficient with cutoff of 0.8 followed by the CV.
See 14] for more details.

Data type and transformation

We processed the gene expression data to estimate the FC values, and then three unsupervised
discretization techniques–Equal-W binning, Equal-F binning, and k-means clustering–on the continuous FC data.

FC is a measure of a quantitative change of gene expression, defined by FC=log2 (T/N), where T is estimated expression values of a tumor sample and N is median expression of normal brain samples.

To determine the number of bins for discretization, Dougherty et al 28] suggested a heuristic to set the maximum number of bins k = max (1, 2 log (l)), where l is the number of distinct values of the attribute. Boulle 29] proposed an algorithm to find an optimal bin number for Equal-F and Equal-W, and
demonstrated the optimal bin number performs similar to the bin number k = 10, considered as a default for most cases. While the former approach resulted the
maximum bin number k = 11, we extensively evaluated by exploring various bin numbers of k = 2i for i={1, 2, …, 10}.

Equal-W binning algorithm seeks for maximum and minimum values, and then divides the
range into the user-defined equal width intervals defined as Equal-W=(max(GE(i))-min(GE(i)))/number of bins, where GE is isoform-level transcript gene expression of sample
i. Then, continuous variables are assigned into the corresponding bin numbers.

Equal-F binning algorithm sorts all continuous variables in ascending order, and then
divides the range into the user-defined intervals so that every interval contains
the same number of sorted values defined as Equal-F=sort(GE(i))/number of bins.

K-means clustering algorithm calculates distance-based similarity to cluster the continuous
variables. With the user-defined number of clusters, the algorithm iteratively finds
centroids until no data point is reassigned to the updated centroids.

Feature selection methods

To capture most significant features, we first applied Pearson’s correlation to the
normalized expression data with a cutoff value of 0.8. Second, we used the CV to assess
the degree of variability for each transcript.

CV is defined as CV=? /µ, where ? and µ are the standard deviation and mean, respectively. Based on the CV scores, we selected
the top 2000 transcripts out of ~115,000. To refine the selected features further,
we employed two advanced feature selection algorithms based on SVM and RF that iteratively
evaluate each feature’s contribution to the classification performance. We adopted
the programs available in R packages ‘mSVM-RFE’ and ‘varSelRF.’

SVM-RFE is a feature search algorithm that measures feature’s importance to the data
by iteratively eliminating one feature at a time 13]. Adopted from the weight vector w of the binary classification problem, the ranking criteria is the coefficients of
; features with the highest weights are the most informative. Thus, the procedure
of SVM-RFE is composed of training the SVM classifier, computing the ranking criteria
for all features, and eliminating the feature with the lowest ranking criterion.
This process is repeated until a small subset of features is achieved.

RF_based_FS method uses both backward elimination strategy and the importance spectrum
to search a set of important variables 31]. Concisely, multiple random forests were iteratively constructed to search for a
set of variable in each forest that yields the smallest out-of-bag (OOB) error rate.
The main advantage of this method is that it returns a very small set of genes while
retaining high accuracy.

Classification methods

We considered the four classification methods–SVM, RF, NB, and PAM–to compare the
performance on platform transition using the 76 GBM samples.

SVM is primarily a two-class classifier that constructs a hyperplane to separate the
data with maximum margin 35,36]. For multiclass classification problems, two techniques are widely used: one is to
build one-versus-all classifiers, and choose the class that yields maximum margin
for test examples; the other is to build a set of one-versus-one classifiers. For
class C 2, C (C?1)/2 binary classifiers are trained and the appropriate class is determined by major
voting. In this study, we used the latter approach with a linear kernel method as
the size of features is larger than samples.

RF is an ensemble learning method that builds decision trees with binary splits 37]. Each tree is grown randomly in two steps. First, a subset of predictors is chosen
at random from all the predictors. Second, a bootstrap sample of the data is randomly
drawn with replacement from the original sample. For each RF tree, an unused observation
is utilized to calculate the classification accuracy.

NB is a simple probabilistic classification method grounded in Bayes’ theorem, for
calculating conditional probabilities, with an independence assumption 38]. For a given instance (example), the NB classifier calculates the probability belonging
to a certain class. The basic underlying assumption is that the features (x1,...,x) of an instance X are conditionally independent given the class C. For example, for a class C that maximizes the likelihood is P(X|C)=P(X1,…,X|C). The conditional independence enables the conditional probability as a product of
simpler probabilities defined by P(X|C)=? P(Xi|C).

PAM is a sample classification method that uses the nearest shrunken centroid approach
for transcript-variants gene expression data 26]. Briefly, the method computes a standardized centroid for each class. Then, it shrinks
each of the class centroids by removing genes toward the overall centroid for all
classes using a user-defined threshold. A new sample is assigned to the nearest centroid
for which classification is based on the unseen sample’s gene expression profile.


We estimated the overall classification accuracy based on the number of correct predictions
divided by the total number of prediction samples defined as ACC=(number of correct
predictions)/(total number of test samples). In addition, sensitivity (Sn) and specificity
(Sp) for each sub-group (one GBM sub-group vs the rest of the GBM groups together)
are calculated as and where tpi, tni, and fni, are true positive, true negative, false positive, and false negative for class Ci, respectively.