Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests

Evaluation measures

We used Breiman’s method as described in 9] to calculate the average Strength (s), the average Correlation (?) and c/s2 as performance measures of a random forest. Out-of-bag estimates were used to evaluate
the strength and correlation. Given s and , the out-of bag estimate of the c/s2 measure can be computed with ?/s2. The correlation measure indicates the independence of trees in a forest whereas
the average strength correspond to the accuracy of individual trees. Low correlation
and high average strength result in a reduction of general error bound measured by
c/s2 which indicates a high accuracy RF model.

Let denote a test data set and Nt denote the number of samples in . The two measures are also used to evaluate the prediction performance of the RF
models on . One is the Area under the curve (AUC). The other one is the test accuracy, computed as:

(3)

where I(·) is the indicator function, yi indicates the true class of and the number of votes for xi on class j.

Results on SNPs data sets

We conducted experiments on two genome-wide SNP data sets whose characteristics are
summarized in Table 1 “Abbr” column indicates the abbreviation of the genome-wide SNP data sets used in
the experiments.

Table 1. Description of two GWA data sets.

The real data Alzheimer disease has been analyzed and reported in Webster et al. 25]. It contained genotypes of a total of 380,157 SNPs in 188 neurologically normal individuals
(controls) and 176 Alzheimer disease patients (cases). The genotype data for Parkinson
disease patients and controls were published in 26]. This genome-wide SNP consisted 271 controls and 270 patients with Parkinson disease,
cerebrovascular disease, epilepsy, and amyotrophic lateral sclerosis. For raw genotype
data with phs000089.v3.p2 study accession can be found in NCBI 1] dbGaP repository.

The 5-fold cross-validation was used to evaluate the prediction performance of the
models on GWA data sets. From each fold, we built the models with 500 trees and the
SNP partition was re-calculated on each training fold data set. We also compared the
prediction performance of the ts-RF model with linear kernel SVM, taken from LibSVM
2], the values of regularization parameter by factors C were 2-2 and 2-5, respectively. These optimal parameter C provided the highest validated accuracy on the training data set. The number of the
minimum node size nmin was 1. The parameters R, mtry and ? for pre-computation of the SNP partition were 30, 0.1M and 0.05, respectively. We used R to call the corresponding C/C++ functions from the
ts-RF model and all experiments were conducted on the six 64bit Linux machines, each
one equipped with IntelR XeonR CPU E5620 2.40 GHz, 16 cores, 4 MB cache, and 32 GB main memory. The ts-RF and wsRF
models were implemented as multi-thread processes, while other models were run as
single-thread processes.

Table 2 shows the average of test accuracies and AUC of the models on the two GWA data asets
using 5-fold cross-validation. We compare our ts-RF model with the Breiman’s RF method
and two recent proposed random forests models, that are the guided regularized random
forests GRRF model 27] and the weighting subspace random forests wsRF model 28]. In the GRRF model, the weights are calculated using RF to produce importance scores
from the out-of-bag data, in which these weights are used to guide the feature selection
process. They found that the least regularized subset selected by their random forests
with minimal regularization ensures better accuracy than the complete feature set.
Xu et al. proposed a novel random forests wsRF model by weighting the input features
and then selecting features to ensure that each subspace always contains informative
features. Their efficient RF algorithm can be used to classify multi-class data.

Table 2. omparison of different random forests models on the SNP pair data sets with different
mtry values.

The latest RF 29] and GRRF 30] R-packages were used in R environment to conduct these experiments. For the GRRF
model, we used a value of 0.1 for the coefficient ? because GRRF(0.1) has shown competitive prediction performance in 27]. We can see that ts-RF and wsRF always produced good results with a different mtry value. The ws-RF model achieved higher prediction accuracy when using . The ts-RF model using outperformed the RF, GRRF, wsRF models and SVM on both GWA data sets, where Mp = ||Xs|| + ||Xw|| denotes the number of informative SNPs. The RF model requires a larger number of
SNPs to achieve better prediction accuracy (mtry = 0.5M). With this size, the computational time for building a random forest is still too
high, especially for GWA data sets. It can be seen that the ts-RF model can select
good SNPs in the subspace to achieve the best prediction performance. These empirical
results indicate that, when classifying GWA data sets with ts-RF built from small
yet informative subspaces, the achieved results can be satisfactory.

1]http://www.ncbi.nlm.nih.gov/

2]http://www.csie.ntu.edu.tw/Ëœcjlin/libsvmtools

Table 3 shows the prediction accuracy and Table 4 shows the c/s2 error bound of the random forest models with different numbers of trees while mtry = ?log2(M ) + 1? was fixed on the GWA data sets, respectively. We conducted these experiments
to compare the new model with other random forests models and observed obvious improvement
in classification accuracy on all GWA data sets. For the comparison of the c/s2 error bound, the GRRF model was not considered in this experiment because the RF
model of Breimen’s method 29] was used in the GRRF model as the classifier. The efficient wsRF model 28] and the Breimen’s method were used for comparison in the experiment. We used the
RF, wsRF and ts-RF models to generate random forests in different sizes from 20 trees
to 200 trees and computed the average accuracy of the results from the 5-fold cross-validation.
We can clearly see that the ts-RF model outperformed other models in classification
accuracy and produced the lowest c/s2 error in most cases on all GWA data sets.

Table 3. The prediction test accuracy of the models on the SNP pair data sets against the number
of trees K.

Table 4. The (c/s2) error bound results of the models on the SNP pair data sets against the number
of trees K.

The proposed ts-RF model was applied to the Parkinson genome-wide data and assigned
a score of importance to each SNP. The resulting list of SNPs was investigated for
potential relevance to the Parkinson disease. Table 5 shows the results of the top 25 SNPs that are located within gene regions studied
by the previous work. For each SNP, details including the rank value, SNP ID, gene
symbol, gene ID, and p-value obtained using Wilcoxon test. The boldface rows in the
table are the interesting genes associated with Parkinson disease. The results of
this real data analysis validate the findings of GWA studies such as PTPRD, EPHA4 and CAST. Results also give other potential SNPs and genes that may be associated with the
Parkinson disease. Specifically, some of these SNPs were found not to be strongly
associated with the Parkinson disease by traditional statistical tests because they
have relatively high p-value. This provides evidence of the advantages of using the
proposed ts-RF model to detect potential SNPs associated with the disease. However,
interpreting results and assessing their biological plausibility is challenging. Biologists
can perform further investigation to validate their relationship with the Parkinson
disease.

Table 5. Top 25 SNPs identified by ts-RF in Parkinson case-control data set.

In summary, ts-RF is a promising method for applying RF method to high-dimensional
data such as GWA data. The application of ts-RF to GWA data may help to identify potential
interesting SNPs that are difficult to be found with traditional statistical approaches.

Results on gene data sets

To validate our conjecture that the proposed ts-RF model is effective for GWA data,
we have conducted additional experiments on gene data sets. In this experiment, we
compared across a wide range the performances of the 10 gene data sets, used in 31,27]. The characteristics of these data sets are given in Table 6. Using this type of data sets makes sense, since the number of genes of these data
sets are much larger than the number of patients. For the RF method to obtain high
accuracy, it is critical to select good genes that can capture the characteristics
of the data and avoid overfitting at the same time.

Table 6. Description of 10 gene data sets.

For the comparison of the models on gene data sets, we used the same settings as in
27]. For coefficient ? we used value of 0.1, because GR-RF(0.1) has shown a competitive accuracy 27] when applied to the 10 gene data sets. From each of gene data sets two-thirds of
the data were randomly selected for training. The other one-third of the data set
was used to validate the models. The 100 models were generated with different seeds
from each training data set and each model contained 1000 trees. The mtry and nmin parameters were set to and 1, respectively. The prediction performances of the 100 classification random
forest models were evaluated using Equation (3).

Table 7 shows the averages of the 100 repetitions of the c/s2 error bound when varying the number of genes per leaf nmin. It can be seen that the RF, wsRF models produced lower error bound on the some data
sets, for examples, COL, BR2, NCI and PRO. The ts-RF model produced the lowest c/s2 error bound on the remaining gene data sets on most cases. This implies that when
the optimal parameters such as and nmin = 1 were used in growing trees, the number of genes in the subspace was not small
and out-of-bag data was used in prediction, the results comparatively favored the
ts-RF model. When the number of genes per leaf increased, so the depth of the trees
was decreased, the ts-RF model obtained better results compared to other models on
most cases, as shown in Table 7. These results demonstrated the reason that the two-stage quality-based feature sampling
method for gene subspace selection can reduce the upper bound of the generalization
error in random forests models.

Table 7. The (c/s2) error bound results of random forest models against the number of genes per leaf
nmin on the ten gene data sets.

Figures 1, 2, 3, 4 show the effect of the two-stage quality-based feature sampling method on the strength
measure of random forests. The 10 gene data sets were analyzed and results were compared
to those of the random forests by Brieman’s method and the wsRF model. In a random
forest, the tree was grown from a bagging training data, the number of genes per leaf
nmin varied from 1 to 15. Out-of bag estimates were used to evaluate the strength measure.
From these figures, we can observe that the wsRF model obtained higher strength on
the two data sets NCI and BRA when the number of genes per leaf was 1. The strength
measure of the ts-RF model was the second rank on these two data sets and it was the
first rank on the remaining gene data sets, as shown in Figure 1. Figures 2, 3, 4 demonstrate the effect of the depth of the tree, the ts-RF model provided the best
results when varying the number of genes per leaf. This phenomenon implies that at
lower levels of the tree, the gain is reduced because of the effect of splits on different
genes at higher levels of the tree. The other random forests models reduce the strength
measure dramatically while the ts-RF model always is stable and produces the best
results. The effect of the new sampling method is clearly demonstrated in this result.

Figure 1. Box plots of Strength measures on the 10 gene data sets with respect to nmin = 1.

Figure 2. Same as Figure 1, but for nmin = 5.

Figure 3. Same as Figure 1, but for nmin = 10.

Figure 4. Same as Figure 1, but for nmin = 15.

Table 8 shows the average test accuracy results (mean ± std-dev%) of the 100 random forest
models computed according to Equation (3) on the gene data sets. The average number
of genes selected by the ts-RF model, from 100 repetitions for each data set, are
shown on the right of Table 8, divided into a strong group Xs and a weak group Xw. These genes were used by the two-stage quality-based feature sampling method in
growing trees in ts-RF.

Table 8. Test accuracy results (mean ± std-dev%) of random forest models against the number
of genes per leaf nmin on the ten gene data sets.

The results from the application of GRRF on the ten gene data sets were presented
in 27]. From these prediction accuracy results in Table 8, the GRRF model provided slightly better result on SRB data set in case nmin = 15 and PRO in case nmin = 1, respectively. The wsRF model presented the best result on BRA and NCI data sets
in case nmin = 15. In the remaining cases on all gene data sets, the ts-RF model shows the best
results. In some cases where ts-RF did not obtain the best results, the differences
from the best results were minor. This was because the two-stage quality-based feature
sampling was used in generating trees in the ts-RF, the gene subspace provided enough
highly informative genes at any levels of the decision tree. The effect of the two-stage
quality-based feature sampling is clearly demonstrated in these results.