A potential signature of eight long non-coding RNAs predicts survival in patients with non-small cell lung cancer


Derivation of an eight-lncRNA prognostic signature from the training dataset

The NSCLC patient cohort from GSE37745 (n = 196), including the relatively large patient
sample size and relatively overall clinical information, was selected as training
dataset to explore the association between lncRNA expression and OS of NSCLC patients.
We first conducted a univariate Cox proportional hazards regression analysis of the
lncRNA expression data with OS as the dependent variable, and identified a set of
eight lncRNAs as prognostic lncRNAs which were significantly correlated with patients’
OS (p value of 0.005). Table 2 shows a list of these eight prognostic lncRNAs along with important variable information.
Of the eight lncRNAs, the higher expression level of lncRNA RP11–21L23.2, GPR158–AS1, RP11–701P16.5 and RP11–379F4.4 was associated with shorter OS (coefficient 0), and the higher expression levels
of the remaining four lncRNAs (CTD–2358C21.4, RP11–94L15.2, KCNK15–AS1 and AC104134.2) were associated with longer OS (coefficient  0). Then we further examined whether
these eight prognostic lncRNAs are differentially expressed between cancer and normal
lung tissue. The lncRNA differential expression analysis was performed for GSE18842
dataset (including 46 tumor and 45 normal lung tissue specimens) (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE18842) 38] obtained from GEO database. We found that five of eight prognostic lncRNAs showed
significant expression differences between tumor and normal lung tissue (Mann–Whitney
U test p  0.05) (Additional file 1: Figure S1), demonstrating that these selected prognostic lncRNAs are associated
with lung cancer.

Table 2. Eight lncRNAs significantly associated with the overall survival of NSCLC patients
in the training set (n = 196)

An eight-lncRNA signature predicts survival of NSCLC patients in the training dataset

To investigate whether the eight-lncRNA signature could provide an accurate prediction
of OS in NSCLC patients, the expression data of these eight lncRNAs and other clinical
features were fitted into a multivariable Cox regression model as covariates of the
training dataset. A risk score was generated for each patient in the training dataset
according to the risk-score model (see “Methods”) as follows: Risk score = (0.306 × expression
value of RP11–21L23.2) + (?0.314 × expression value of CTD–2358C21.4) + (?0.252 × expression value of RP11–94L15.2) + (0.288 × expression value of GPR158–AS1) + (?0.271 × expression value of KCNK15–AS1) + (?0.299 × expression value of AC104134.2) + (0.284 × expression value of RP11–701P16.5) + (0.321 × expression value of RP11–379F4.4). To evaluate how well the risk score predicts the 5-year survival, the various cutoff
values were evaluated using time-dependent ROC curve (Figure 1a) which is commonly used for revealing the predictive accuracy of a model 39], 40]. In the training dataset, AUC for the eight-lncRNA signature prognostic model was
0.78 at an OS of 5 years, demonstrating the better performance for survival prediction
of the lncRNA expression-based risk score in the training dataset. All patients in
the training dataset were then ranked according to their risk score, and divided into
either the high- or low-risk group using the median risk score as the cutoff point.
According to this cutoff value, patients were divided into either a high-risk group
(n = 98) or a low-risk group (n = 98). Patients in the high-risk group had a significantly
shorter OS than those in the low-risk group (median OS 1.67 vs. 6.06 years, log-rank
test p = 4.33E?09). Kaplan–Meier curves for the high- and low-risk groups in the training
dataset (n = 196) are shown in Figure 1b. In detail, OS rates of patients in the high-risk group were 30.6% at 4 years, 19.1%
at 6 years, 17.8% at 8 years and 11.9% at 10 years, versus 63.3, 53.8, 46.3 and 38%
in the low-risk group, respectively.

Figure 1. The eight-lncRNA signature-focused risk score in prognosis of overall survival in
the GSE37745 patient set. a Receiver operating characteristic (ROC) analysis of the risk scores for overall survival
prediction in the training dataset. The area under the curve (AUC) was calculated
for ROC curves, and sensitivity and specificity were calculated to assess score performance. b The Kaplan–Meier curve for overall survival of two patient groups with high- and
low-risk scores in the GSE37745 training set (n = 196). The differences between the
two curves were evaluated by the two-sided log-rank test. c The eight lncRNA-based risk score distribution, patients’ survival status and heatmap
of the eight lncRNA expression profiles. The black dotted line represents the cutoff value of the risk score derived from the training set which
separated patients into high- and low-risk groups.

A significant association between the eight-lncRNA signature risk score and OS was
observed in the univariable Cox regression model (Table 3). The hazard ratios of the eight-lncRNA signature risk score of the high-risk group
versus that of the low-risk group for OS was 2.641 [p  0.001; 95% confidence interval
(CI) 1.887–3.697; Table 3].

Table 3. Univariable and multivariable Cox regression analysis of the lncRNA signature and
overall survival of NSCLC patients in the training and two independent cohorts

The distribution of risk score, survival status and prognostic lncRNA expression in
196 patients of the training dataset are shown in Figure 1c. Of these eight prognostic lncRNAs, the high expression level of lncRNA RP11–21L23.2, GPR158–AS1, RP11–701P16.5 and RP11–379F4.4 was associated with high risk, while the remaining four lncRNAs (CTD–2358C21.4, RP11–94L15.2, KCNK15–AS1 and AC104134.2) were shown to be protective. NSCLC patients with high prognostic scores tended to
express high-risk lncRNAs, whereas those with low prognostic scores tended to express
protective lncRNAs. Moreover, more deaths were noted for NSCLC patients with high-risk
scores than for those with low-risk scores.

Validation of the eight-lncRNA signature for survival prediction in the testing GSE31210
dataset

To validate the prognostic power of the eight-lncRNA signature for survival prediction,
the constructed expression-defined lncRNA prognostic model was also evaluated in the
testing GSE31210 dataset. The same prognostic risk score model obtained from the training
dataset was used to calculate the eight-lncRNA signature-based risk scores for 226
patients in the entire GSE31210 dataset. The cutoff value of the risk score derived
from the training dataset without re-estimating parameters was used for the testing
dataset to classify the patients into either a high-risk group (n = 111) or a low-risk
group (n = 115). Patients with high-risk scores exhibited poorer OS than those with
low-risk scores (median OS 4.45 vs. 5.08 years, log-rank test p = 1.65E?03). Kaplan–Meier
curves for the high- and low-risk groups in the testing dataset are shown in Figure 2a. The OS rate of patients in the high-risk group was 91.7% at 2 years and 78.7% at
4 years, versus 97.4 and 91.5% in the low-risk group, respectively. A significant
association between the eight-lncRNA signature risk score and OS in the univariable
Cox regression model was observed. The hazard ratios of the eight-lncRNA signature
risk scores of the high-risk group versus the low-risk group for OS was 3.067 (p = 0.003;
95% CI 1.471–6.395; Table 3).

Figure 2. The eight-lncRNA signature-focused risk score in prognosis of overall survival in
additional validation datasets. a Kaplan–Meier survival curves were plotted for GSE31210 (n = 226). b The eight lncRNA-based risk score distribution, patients’ survival status and heatmap
of the eight lncRNA expression profiles were analyzed in the GSE31210. c Kaplan–Meier survival curves were plotted for GSE50081 (n = 181). d The eight lncRNA-based risk score distribution, patients’ survival status and heatmap
of the eight lncRNA expression profiles were analyzed in the GSE50081.

The distribution of patient lncRNA risk score, survival status and prognostic lncRNA
expression in 226 patients of the GSE31210 dataset are shown in Figure 2b, revealing the similar results observed in the GSE37745 training dataset.

Further validation of the eight-lncRNA signature in another independent dataset

To investigate the reproducibility of the eight-lncRNA signature in predicting OS
of NSCLC patients, the prognostic power of the eight-lncRNA signature for prediction
of survival was further validated in another independent NSCLC cohort of 181 patients
whose expression and survival data were obtained from GEO GSE50081. The clinical feature
of this independent NSCLC cohort is shown in Table 1. Patients in this independent NSCLC cohort were classified into either a high-risk
group (n = 90) or a low-risk group (n = 91) according to the cutoff value of risk
scores obtained from the training dataset. The median OS of the high-risk group for
the GSE50081 dataset is 4.29 years, whereas that of the low-risk group is 4.99 years
(log-rank test p = 1.26E?02). Kaplan–Meier curves for the high- and low-risk groups
within the independent GSE50081 cohort is shown in Figure 2c. Further univariable Cox regression analysis revealed that the high-risk scores
of eight-lncRNA signature was significantly associated with lower OS of patients in
GSE50081 dataset (p = 1.40E?02; HR = 1.795, 95% CI 1.127–2.859; Table 3). Figure 2d shows the distribution of patient risk scores, the survival status and prognostic
lncRNA expression in the independent GSE50081 NSCLC cohort, ranked according to the
prognostic risk score values for the eight-lncRNA signature, which were similar to
those observed in the training and GSE31210 datasets.

Survival prediction by the eight-lncRNA signature is independent of clinical features

To assess whether the prognostic power of the eight-lncRNA signature for prediction
of survival was independent of other clinical features, multivariable Cox regression
analysis was performed using the lncRNA signature-based risk score and other clinical
features, including age, gender, smoking status, tumor stage and subtype, which were
used as covariates. The results of multivariable Cox regression analysis from three
NSCLC patients datasets showed that the prognostic power of the eight-lncRNA signature
risk score (high-risk group vs. low-risk group, HR = 2.761, 95% CI 1.934–3.942, p  0.001
for GSE37745; HR = 2.643, 95% CI 1.263–5.528, p = 0.01 for GSE31210; HR = 1.752, 95%
CI 1.014–3.026, p = 0.044 for GSE50081) for prediction of survival was indeed independent
of these clinical features (Table 3). We also found that the two clinical factors, age and stage, also affected overall
survival of patients. So, a data stratification analysis was performed according to
age and stage. The three GEO datasets (GSE37745, GSE31210 and GSE50081), which included
a total of 603 patients, were stratified by age into either a younger stratum (age ?65)
or an elder stratum (age 65). The results of stratified analysis showed effective
prognostic power in both the younger and elder patient groups. The eight-lncRNA signature
could classify patients within each age stratum into either high- or low-risk groups
with significantly different OS (log-rank test p = 4.46E?05 for the younger patient
group and p = 6.61E?06 for the elder patient group) (Figure 3a, b), which suggested that the prognostic power of the eight-lncRNA signature was
also age-independent. Then the patients of early (stage I and II) and late (III and
IV) stage for GSE37745 dataset were grouped into two separate groups. The stratified
analysis was further performed in early and late stage patients to evaluate whether
the eight-lncRNA signature could predict survival of patients for different clinical
stage. The log-rank test of early stage patients showed that within stage I and II,
the eight-lncRNA signature could further subdivide them into either a high-risk group
with shorter survival or a low-risk group with longer survival (median OS 2.03 vs.
8.05 years, log-rank test p = 7.81E?09) (Figure 3c). Difference for OS between high-risk group (n = 18) and low-risk group (n = 13)
also was observed for late stage patients (median OS 0.975 vs. 3.367 years) (Figure 3d), although the log-rank p value is 0.253 which was above the 0.05 significance level.

Figure 3. Survival analyses of all patients with available age or tumor stage information using
the eight-lncRNA signature. a Kaplan–Meier survival curves for younger patients with NSCLC (age ?65, n = 337). b Kaplan–Meier survival curves for elder patients with NSCLC (age 65, n = 226). c Survival prediction in early stage patients: Kaplan–Meier survival curves for all
patients with stage I and II (n = 165). d Survival prediction in late stage patients: Kaplan–Meier survival curves for all
patients with stage III and IV (n = 31).

Functional characterization of the eight prognostic lncRNAs

To further investigate the potential biological roles involving the eight prognostic
lncRNAs, the co-expressed relationships between the expression of eight lncRNAs and
those of the protein-coding genes were computed using Pearson correlation coefficients
in the GSE37745 dataset of 196 patients. The expression of 679 protein-coding genes
were highly correlated with that of at least one of the eight signature lncRNAs (Pearson
correlation coefficient 0.40). GO and KEGG pathway function enrichment analysis for
these co-expressed protein-coding genes was then performed, using the whole human
genome as the background. The results showed that four genes (GATA6, CRISPLD2, CFTR2 and CLPTM1L) have been proven to be involved in lung cancer. GO functional annotation suggested
that 679 protein-coding genes were significantly enriched in 28 GO terms (Figure 4a). These significant GO terms were organized into an interaction network with similar
functions using the Enrichment Map 41] plugin in Cytoscape 42]. Several clusters of functionally related GO terms were observed including organ
development and cell proliferation and immune, response to stimulus, catabolic and
metabolic process (Figure 4b). Taken together, these results implied that the eight lncRNAs might be involved
in tumorigenesis through interacting with protein-coding genes that affect the tissue/organ
development and other important biological processes.

Figure 4. Functional enrichment results of the co-expressed protein-coding genes with prognostic
lncRNAs. a Significantly enriched GO terms of the co-expressed protein-coding genes with prognostic
lncRNAs. b The functional enrichment map of GO terms. Each node represents a GO term, which are grouped and annotated by GO similarity. A link represents the overlap of shared genes between connecting GO terms. Node size represents the number of gene in the GO terms. Color intensity is proportional to enrichment significance.