The molecular landscape of premenopausal breast cancer

Differences in gene expression comparing preM and postM breast cancer

Here we set out to understand differences in molecular make-up between preM and postM
breast cancer, starting with the analysis of differential gene expression (DE). As
a first step, we performed principal component analysis (PCA) of RNA-seq data (Fig. 1a) and Agilent microarray data (Additional file 13: Figure S2A) from normal and breast cancer samples in TCGA. Normal and tumor samples
were separated by the first principal component (PC), while ER+ and ER? tumor samples
were separated by the second PC (Fig. 1b, Additional file 13: Figure S2A). PreM and postM patients were not separated by the first two PCs. Further,
the third PC did not separate preM and postM patients (data not shown). Similar results
were obtained when using methylation data (Additional file 13: Figure S2B).

Fig. 1. Differences in outcome and gene expression comparing premenopausal (preM) and postmenopausal (postM) tumors. a Principal component analysis (PCA) using RNA seq data from The Cancer Genome Atlas
(TCGA). Tumor: purple triangle, estrogen receptor-negative (ER–)/preM; red diamond, ER–/postM; orange star, estrogen receptor-positive (ER+)/preM; pink square, ER+/postM. Normal: light blue triangle, ER–/preM; black diamond, ER–/postM; green star, ER+/preM; dark blue square, ER+/postM. b Heatmap showing genes differentially expressed (n = 1,609) between ER+/preM and.
ER+/postM, using RNA-Seq data. Each gene is normalized to standard normal distribution
with green indicating lower expression, and red indicating higher expression. PC principal component

We used DE analysis to compare gene expression data in preM and postM tumors, stratified
by ER status. Analysis was performed using RNA-seq and Agilent microarray data with
a focus on RNA-seq data due to the larger number of samples. RNA-seq data were available
on 612 preM and postM ER+ tumors and ER– tumors (ER+, number of preM:number of postM
= 109:372, and ER–, number of preM:number of postM = 37:94). For ER+ tumors, we found
a total of 1,609 and 564 DE genes in RNA-seq data (Fig. 1b) and Agilent array data (Additional file 13: Figure S3A), respectively (gene lists are provided in Additional file 2). The overlap (Additional file 13: Figure S3B) of 301 genes between RNA-seq and Agilent microarray data was statistically
significant (p 0.0001). Adding a restriction to fold change (1.5 or 0.5) decreased the number
of DE genes to 178 (89.3 % overexpressed in preM) in RNA-Seq data. The top overexpressed
genes in preM tumors were AREG, TFPI2, AMPH, DBX2, RP5-1054A22.3, and KLK5, and the genes with significantly lower expression in preM were ESR1, CYP4Z1, and RANBP3L, FOXD2, and PEX3. In contrast to ER+ tumors, no DE genes were detected when comparing preM
and postM ER– tumors.

Because of the influence of sample size on DE analysis, and the difference in the
size of the preM and postM groups, we conducted an analysis using a subsample of ER+
preM and postM tumors of the same size as in the ER– tumors. We again detected a statistically
significant number of genes DE between preM and postM in ER+ but not in ER– tumors
(data not shown). During the subsampling (n = 100), we identified 28 genes that were
consistently detected in more than 10 subsample tests, and 2 genes (AREG and ESR1)
were detected as DE in more than 50 subsample tests. Due to the lack of significant
differences between ER– preM and postM tumors, our subsequent analyses focused on
ER+ disease.

Differences in methylation, CNV, somatic mutation and proteomics comparing preM and
postM ER+ breast cancer

To determine significant differences in methylation between preM and postM ER+ breast
tumors, we analyzed methylation data from TCGA. A total of 1,738 probes (mapping to
818 unique genes) were differentially methylated after setting the FDR at 5 % and
constraining the absolute difference of the average beta value within the preM and
postM groups to be 0.1 (Fig. 2a). Among them, 48 % of the probes (373 genes) were hypo-methylated in preM tumors,
while 52 % probes (457 genes) were hyper-methylated in preM tumors. ESR1, MAT2B, CTSS, DDR2 and GALNTL2 were the top genes hyper-methylated in preM tumors relative to postM tumors. RPL3, FBXL16, RASGEF1A, KLF6 and MCM7 were the top genes hyper-methylated in postM ER+ breast cancer (genes lists are provided
in Additional file 2).

Fig. 2. Differences in methylation and copy number variation (CNV) between premenopausal (preM) and postmenopausal (postM) estrogen receptor-positive (ER+) tumors: a Heatmap representing significant differences in methylation of 1,738 probes between
preM (n = 75) and postM (n = 233) tumors. Red and green indicate higher and lower levels of methylation, respectively. b Heatmap representing significant changes in CNV at gene level between preM and postM
ER+ tumors (772 genes). White indicates diploid normal copy, gray indicates single copy deletion (LOH), blue indicates homozygous deletion, yellow indicates low-level copy number amplification, and red indicates high-level copy number amplification. c Peaks of significant amplification (top panels) and deletion (bottom panels) in preM and postM ER+ tumors. Peaks were identified by GISTIC 2.0. The x-axis represents
the G score (top) and Q value (bottom). The vertical green line represents the significance cutoff q value = 0.25. Stars indicate significantly different regions of amplifications or
deletions comparing preM and postM ER+ tumors. d Mutation distribution in the top five differentially mutated genes (CDH1, GATA3, MLL3/KMT2C, GPS2, PIK3CA). Red, black, and gray indicate truncating, in-frame and other mutations, respectively. Figures were generated
with cBio MutationMapper 49], 50]

We performed a gene-level comparison of CNV between preM and postM ER+ tumors: 772
genes had different CNV, with SAFB2, TNFSF9, C19ORF70, and HSD having the most significant changes (Fig. 2b) (gene lists are provided in Additional file 2). We also performed GISTIC to identify regions of frequent and significant aberrations
in ER+ preM and postM tumors (Fig. 2c). GISTIC identified 26 significant amplifications and 33 deletions in ER+ postM tumors,
and 18 significant amplifications and 39 deletions in ER+ preM tumors (Additional
file 3). Inspection of the GISTIC score file in Integrated Genomics Viewer (IGV) identified
nine regions of deletion (10q26.3, 10q23.31, 11q13.1, 12q24.21, 4q34.3, 6q15, 7p22.3,
7q36.1, 8p21.3) and six regions of amplification (13q12.3, 16q12.1, 4q13.3, 6q25.1,
6q12, 8q13.1) that were unique to ER+ postM tumors. Three regions of deletion (19q13.32,
22q12.3, 22q13.31) and two regions of amplification (6p23, 7p21.1) were uniquely identified
in ER+ preM tumors.

To identify genes differentially mutated in preM and postM tumors, we applied the
MutSig analysis algorithm. Genes significantly mutated in preM or postM (q 0.05) are shown in Additional file 2. Five genes had statistically significantly different mutation rates between preM
and postM tumors: CDH1 (14.5 % vs 2.9 %), GATA3 (9.4 % vs 18.6 %), MLL3 (13.3 % vs 4.9 %), GPS2 (0.8 % vs 3.9 %), and PI3KCA (48.7 % vs 37.3 %). The mutation distributions are shown on a diagram of the domain
structures of the respective proteins (Fig. 2d). After correction for multiple comparisons only one gene, E-cadherin (CDH1) remained differentially mutated between the preM and postM groups. This finding
is in agreement with E-cadherin being frequently mutated in invasive lobular carcinoma
(ILC), which are more common in older patients. It is possible that additional genes
have significantly different mutation rates in preM and postM tumors, but that we
failed to detect those due to limited power. Given our sample number of 102 ER+ preM
and 384 ER+ postM tumors used for the MutSig analysis, and assuming mutation rates
of 0 % and 2.5 %, or 2.5 % and 10 %, we had 81 % and 83 % power to detect those (detailed
information on power analysis for our study is shown in Additional file 2).

We determined overall base pair mutation rates, and found significantly increased
rates of mutations in postM compared to preM breast cancer (0.99 vs 0.67 mutation/Mb;
p 0.0001). We asked whether there was a difference in the mutations spectra of ER+
preM and postM tumors and observed an increase in CT mutations in postM tumors (Additional
file 13: Figure S4A) translating to an increase in transitions (Additional file 13: Figure S4B) (preM 43 %, postM 54 %; p 0.0001 chi-square test). Further examination of the trinucleotide mutation spectra
showed that the increased CT mutations in postM cancer was limited primarily to the
context of a 5? T and 3?G (TCGTTG) (Additional file 13: Figure S4C). Additionally, postM cancers were enriched for mutations within TCW
motifs that are associated with APOBEC-induced changes (preM 27 %, postM 32 %; p 0.001 chi-square test), and this increase was limited to the context of TCTTAT
(Additional file 13: Figure S4D).

To compare protein expression and posttranslational modifications between preM and
postM ER+ tumors we studied RPPA data which were available for 142 proteins in 233
ER+ tumors at the time of our data lock. The only significant difference detected
was increased expression of ER? in postM ER+ tumors (Additional file 13: Figure S5 and Additional file 2). Phosphorylation of ER? at Ser118 ranked third and was significant at a nominal
level (p value 0.0016, adjusted p value 0.17).

Identification of signaling pathways enriched in preM ER+ tumors

IPA was used to identify active pathways in preM tumors. We limited the analysis to
genes which (1) were differentially expressed between preM and postM ER+ tumors, and
(2) were significantly different between preM ER+ tumors and normal tissue using the
union of genes identified from both RNA-seq and Agilent microarray data (Additional
file 4). IPA analysis revealed integrin signaling as the most significant canonical pathway
altered when comparing ER+ preM and ER+ postM tumors (Additional file 5). Pathway enrichment is shown for both RNA-seq and Agilent microarray data analyzed
individually (Additional file 13: Figure S6) which showed strong concordance. A potential role for integrin signaling
is supported by DAVID functional analysis, which identified focal adhesion as the
most significantly altered pathway in preM and postM ER+ tumors (Additional file 13: Figure S7). IPA analysis identified the transforming growth factor (TGF)? pathway
as the most significant upstream regulator of the transcriptional network in preM
ER+ breast cancer (Additional file 6).

In order to integrate the information-rich data from multiple omics platforms we implemented
PARADIGM analysis incorporating gene expression, copy number, and methylation changes
to infer pathway-level changes. Out of 8,674 entities in the database, 1,026 were
significantly different in preM compared to postM tumors (p 0.05), with a number of integrin and laminin pathways represented among the top
50 entities (Additional file 13: Figure S8). GSEA 16] on PARADIGM predictions identified enriched superPathways in preM tumors; 65 superPathways
were identified as being statistically significant (Fig. 3, Additional file 7), with integrin signaling being the only pathway identified multiple times among
them (?6?1 and ?6?4 integrin signaling; ?1 integrin cell surface interactions; integrin
signaling pathway; agrin in postsynaptic differentiation; ?5 ?6 ?7 and ?8 integrin
cell; surface interactions; ?6?4 integrin-ligand interactions; integrins in angiogenesis;
integrin cell surface interactions). A two-sample t test for 19 integrin/laminin genes showed most significant enrichment of laminins
A1, B1, B2, C1, C2, and C3, and integin ?4 and ?1 (Additional file 13: Figure S9). In addition to the integrin pathway, we took note of the epidermal growth
factor receptor (EGFR) pathway, which was significantly activated in preM disease,
along with its ligand amphiregulin (AREG) (blue stars in Fig. 3).

Fig. 3. Pathways activated in premenopausal (preM) breast tumors. Top 50 entities in superPathway
analysis using Agilent array, copy number variant (CNV) and mutation analysis of The
Cancer Genome Atlas (TCGA) data (preM, n = 65 and postM, n = 239). Green indicates low activity score and red indicates high activity score. Red stars refer to pathways related to integrin/laminin signaling, and blue stars label EGFR and AREG signaling

Validation using METABRIC

To validate our findings from TCGA data, we conducted an independent analysis using
the METABRIC dataset, containing 130 preM and 1,113 postM ER+ tumors, and 121 preM
and 227 postM ER? tumors. An unbiased stratified analysis of gene expression data
detected 2,542 differentially expressed genes in ER+ tumors, among which 1,322 genes
(52.0 %) were over-expressed in preM ER+ samples (Additional file 8). The same analysis detected only 146 differentially expressed genes in ER? samples,
confirming our finding from TCGA data of fewer differences between preM and postM
gene expression in ER? tumors. There was a significant overlap between the differentially
expressed genes in TCGA and METABRIC; 31 % and 36 % of genes over-expressed in ER+
preM and postM in TCGA, respectively, were also found to be differentially expressed
in METABRIC (p 0.0001).

We next asked whether integrin signaling was among the top differentially activated
pathways comparing preM and postM tumors in METABRIC. We applied IPA and confirmed
TGFB1 as the most significantly activated upstream regulator (Additional file 9), and integrin signaling among the top 20 canonical pathways. PARADIGM analysis also
identified a number of integrin and agrin signaling among the top 50 differentially
activated pathways using GSEA (Additional file 10) (?1 integrin cell surface interactions; ?6?4 integrin ligand interactions; integrin
cell surface interactions). Also, as in TCGA, EGFR signaling was found to be significantly
more active in preM ER+ disease.

Sub-clusters within ER+ preM tumors with poor outcome

Given the striking molecular differences between ER+ preM and postM tumors (Figs. 1 and 2), we applied several unsupervised and semi-supervised clustering methods to analyze
ER+ PreM tumors. Unsupervised hierarchical clustering of ER+ preM tumors from both
Agilent and RNA-Seq platforms (Additional file 13: Figure S10) using the top 2,500 differentially expressed genes showed two or more
distinct patterns of expression amongst luminal (Lum)A, LumB and human epidermal growth
factor receptor 2 (Her2)-like tumors. A small fraction of tumors were classified as
basal-like by PAM50, and they clustered together. Sparse k-means clustering of ER+ preM TCGA tumor samples with 1,117 selected genes (Additional
file 11) identified three clusters. Clusters 1 (n
1
= 36) and 2 (n
2
= 29) were mixtures of LumA, LumB and Her2 subtype samples, while cluster 3 (n
3
= 4) was limited to four basal-like tumors (Fig. 4a). Data were available in METABRIC for 976 genes out of the 1,117 genes detected in
TCGA expression (Additional file 11) and we thus used these genes (n = 976), and the same number of clusters (n = 3)
for routine k-means clustering on ER+ preM tumors in METABRIC (Fig. 4b). As observed in TCGA data (Fig. 4a), such clustering does not simply group the tumors based on the different molecular
subtypes as defined by PAM50. Importantly, the three clusters showed very distinct
survival in METABRIC (log-rank test p value 2.6
E-5
, Fig. 4c), with patients whose tumors were in cluster 1 showing extremely good outcome, and
patients whose tumors were in cluster 2 having poor outcomes. The LumA tumors in clusters
1 (n = 33) and 3 (n = 18) had obvious differences in gene expression patterns, and
to specifically compare these two subgroups, we examined the difference between them
for 6,030 probes that had the largest variation (Additional file 13: Figure S1). After BH adjustment of p values, 28 of them remained significant (adjusted p value 0.05 and effect size 1 or ?1, Additional file 11). The log-rank test comparing survival between the two LumA groups was significant
with a p value of 0.003 (Additional file 13: Figure S10).

Fig. 4. Sub-clusters within premenopausal (preM) estrogen receptor-positive (ER+) tumors. a Clustering result of The Cancer Genome Atlas (TCGA) gene expression data by sparse k-means yielded three clusters (n
1
= 36, n
2
= 29, n
3
= 4). b Sparse k-means clustering in Molecular Taxonomy of Breast Cancer International Consortium
(METABRIC) preM ER+ tumors using genes selected by TCGA (n
1
= 56, n
2
= 19, n
3
= 43). c Kaplan-Meier survival curve using clustering result from sparse k-means analysis from METABRIC data. d Semi-supervised clustering of METABRIC gene expression data from preM ER+ tumor.
e Kaplan-Meier survival curve of three groups from semi-supervised clustering of gene
expression from d. f Kaplan-Meier survival curves of groups resulting from four different clustering algorithms,
with the p value from the log-rank test. Lum luminal, Her2 human epidermal growth factor receptor 2

We next performed semi-supervised machine learning of ER+ preM tumors using the METABRIC
dataset: 225 genes were selected following a constraint of absolute effect size 1
and the FDR at 5 %, and k-means clustering with k = 3 was performed (Additional file 11, Fig. 4d). The three clusters showed distinct survival rates, with patients whose tumors fell
into cluster 3 having extremely poor survival (log-rank test p = 3
e-11
, Fig. 4e).

Finally, we asked how our clustering approach compared to two widely used multigene
assays, OncotypeDx and breast cancer index (BCI) 19], 20] in stratifying risk of recurrence. To avoid overfitting of the data, a 10-fold cross-validation
approach was applied. As both Oncotype Dx and BCI classify tumors into three groups
(high, intermediate, and low risk) we selected k = 3 in all algorithms. Similarly, as OncotypeDx uses 21 genes, we set the number
of genes selected in sparse k-means and semi-supervised clustering to 21. In sparse k-means clustering, the 21 genes were selected according to the weight (largest weight),
and in semi-supervised clustering, the 21 genes were selected based on the largest
Cox regression analysis score (Additional file 11). Unsupervised sparse k-means was the least successful in identifying different outcomes (log rank p value 0.0481) (Fig. 4f). Our semi-supervised approach (p = 0.000223) and BCI (0.000243) identified three different risk groups, with equally
good ability to predict survival. Similar results were obtained for Oncotype Dx that
identified three different risk groups (p = 0.0183). Additional studies need to be performed using large external datasets
in order to validate our clustering approach, and its use to predict outcomes in different
preM subgroups.