Integration of gene expression and DNA-methylation profiles improves molecular subtype classification in acute myeloid leukemia

AML dataset

For 344 adults, clinical, cytogenetical and molecular characteristics were analysed
using bone marrow or peripheral blood, as described previously 3,9]. For each patient sample, genome-wide mRNA expression data (GEP) is measured using
Affymetrix HGU133 plus2.0 (Santa Clara, CA, USA). Normalization of raw data was processed
with RMA 21-23] and probes on the array are remapped to refseq transcripts using a custom chip definition
file (CDF) 24] (NGEP = 21678 refseqs). For the same set of samples, genome-wide DNA-methylation data (DMP)
was measured with the HELP-assay 25] and pre-processed as described previously 9] (NDMP = 22725 features). Both datasets are annotated using UCSC hg19 genome build, and are
available from the NCBI Gene Expression Omnibus accession numbers GSE14468 (HOVON-SAKK
cohort) and GSE18700, respectively.

Cytogenetical and molecular abnormalities in AML

Groups of AML patients that are characterized by common cytogenetic or molecular abnormality
are denoted as subtypes. We studied fifteen of the most common subtypes, which can
roughly be categorized into three risk groups (good, intermediate and poor). AML subtypes
in the good risk group are inv(16), t(15;17), t(8;21). The intermediate risk group
contains patients with molecular abnormalities, i.e. CEBPAdouble-mutant, CEBPAsilenced, NPM1mutant, FLT3ITD , FLT3TKD, FLT3ITD/NPM1wt, FLT3 /NPM1mutant, FLT3ITD/NPM1mutant, and NRAS cases. The poor risk group contains patients with complex karyotype (patients with
more than 3 cytogenetic abnormalities), i.e. 3q, 7q, and 11q23 cases. We used the
15 subtypes as classification labels. Note that samples can contain multiple abnormalities.

Classification strategies

The AML subtypes are classified using three different strategies: i) no integration
using the GEP or DMP-dataset individually); ii) early integration; and iii) late integration
(Figure 3). For each subtype, we train a two class classifier to distinguish between samples
that belong to the subtype and samples that belong to the other 14 subtypes (i.e.,
one versus all classification scheme). We employed the logistic regression classifier
with lasso regularization 26,27], which optimizes the regression output and selects features by enforcing sparsity.
To assure unbiased measurements of the performance of the classifier we followed the
double-loop crossvalidation protocol (DLCV) 28]. First, we split the input set into five equal subsets (the outer loop). For each
iteration we use one subset as validation set and the other four subsets as input
set for classifier training. Training of the classifier is based on a 5-fold cross-validation
scheme (the inner loop) to optimize the regression model parameters (see Figure 3 for more details). To make sure that each feature is penalized similarly by lasso
regressor, we standardized each feature to its unit second central moment before applying
penalization.

Figure 3. Schematic overview of classification approach along with integration strategies. The left part shows no and early integration strategies by training the logistic
regression classifier on only GEP or DMP and GEP+DMP, respectively. In the DLCV scheme
we split the input data into five subsets. The classifier was trained by means of
a 5-fold cross-validation approach using four subsets for training and testing and
one for validation. The right part shows the late integration procedure where the
nearest mean classifier (NMC) (i.e. second layer) was trained on the new two-dimensional
data which represents the first layer outcomes for GEP and DMP sets. The second layer
was evaluated by the first layer outcomes of the validation subset. The reported performance
is the average of classification performance on the 5 validation subsets.

i No integration

For the GEP and the DMP input datasets we followed the DLCV scheme, for each of the
datasets separately. The optimal set of features, i.e. those discriminating between
patients from one subtype and the remaining patients, were determined in the DLCV
inner loop. Subsequently, we used this set of features to classify samples in the
independent validation set (DLCV outer loop) and then calculate the classification
performance and accuracy for each data type.

ii Early integration

In this strategy, we combined all the features by concatenating the GEP and DMP features
yielding NTOTAL = 44403 features. Then, we followed the DLCV scheme, with the exception that we now
provided the regression model with all features.

iii Late integration

For the late integration strategy, we established a two-layer classifier (Figure 3). Initially, we followed the DLCV scheme for each data type separately. Each inner
loop generates two optimized sets of parameters for the logistic regression model,
one set for each data type. In the next step we train an additional classifier, nearest
mean classifier (NMC), that uses the posterior probabilities of the GEP and DMP logistic
regressors as feature space. Hence, the integration of the two data types is achieved
by exploiting the confidence with each individual data type. Finally the output of
the NMC is evaluated on the validation set.

Classification accuracy and performance

F-score

The F-score is used to test the prediction accuracy, which considers both the positive
predictive value (precision) and the true positive rate (recall or sensitivity), and
varies between 0 (worst accuracy) to 1 (best accuracy). To assign a sample to a class,
we used the default threshold of 0.5 on the posterior probability obtained from the
logistic regressor. As a result, samples for which the posterior probability is between
0.5 and 1.0 are assigned to the subtype of interest. The F-score is especially of
interest in diagnostic settings where it is important to know how many patients (samples)
are correctly or wrongly classified.

Area under the curve (AUC)

The Area under the curve (AUC) is computed on the Receiver Operator Characteristics
(ROC), which integrates performance scores over all possible thresholds on the posterior
probability obtained from the regressor. This effectively considers all possible assignments
of samples to the subtype by the regressor.

Global test

To examine whether the global pattern of an input set significantly associates with
the subtype, we apply the global test method 29]. The use of the global test in the evaluation is important beacuse the classification
accuracy (F-score) and the classification performance (AUC) describe only the classifier
output scores on the test set. The global test on the other hand results in a P-value based on the null hypothesis that there is no information in the given input
features related to the sample label (e.g. subtype). In fact, the global test method
applies a regression model to test the null hypotheses that the variance of regression
coefficients of all input features is zero and subsequently calculates a test statistics.