Discovering transnosological molecular basis of human brain diseases using biclustering analysis of integrated gene expression data

Identification of brain disease-specific coexpressed gene sets

We aimed to identify coexpressed gene sets either in brain diseases or in normal to
find molecular mechanisms potentially associated with brain diseases. Especially,
we focused on finding molecular mechanisms common to multiple brain diseases. We chose
five neurodegenerative and three psychiatric disorders, covering Alzheimer’s disease,
Parkinson’s disease, Huntington’s disease, multiple sclerosis, amyotrophic lateral
sclerosis, schizophrenia, bipolar disorder, and autism along with controls. For an
integrative gene expression analysis, we combined three microarray datasets into a
single dataset by adjusting batch-specific effects using ComBat method 15]. As a result, we could get a microarray dataset with 6688 distinct genes and 237
samples. We applied the biclustering method to the combined microarray dataset to
efficiently get initial sets of biclusters in which at least genes show correlated
expression values under arbitrary subset of samples with average Pearson’s correlation
coefficients (PCC) equal or greater than 0.7 in a transnosological manner. We chose
0.7 as the threshold based on our empirical simulation result that the probability
that an arbitrarily composed bicluster has average PCC equal or greater than 0.7 is
statistically significantly rare (P = 1E-04). We selected only those biclusters containing
all samples from one or more of classes (e.g. all samples from multiple sclerosis
class and all samples from Alzheimer’s disease class). We eliminated samples from
each of the selected biclusters if the samples are just part of certain class. With
the refined biclusters, we next determined whether the gene coexpression is substantially
gained or lost in brain diseases compared to normal. For this process, we separately
calculated average PCC of the genes included in each refined bicluster in included
brain diseases and in normal, and got the difference in the average PCCs. We assigned
p-values to the difference by using background distribution of difference of average
PCCs between brain diseases and normal from 100,000 random gene groups. We finally
identified coexpressed gene sets showing statistically significant gain or loss of
coexpression in brain.

A total of 4,307 gene sets were identified to be coexpressed in at least two brain
diseases, implying that there might be large number of molecular mechanisms commonly
associated to multiple brain diseases with Benjamini corrected p-value of 0.005. Given
that the number of coexpressed gene sets specific to individual brain diseases is
3,409, our finding supports the hypothesis that there are huge similarity between
the investigated brain diseases at molecular level for the first time. The numbers
of shared coexpressed sets show degree of association among different combinations
of brain diseases in Figure 1. We found the similarity group among brain diseases by the number of total coexpressed
gene sets between all possible combinations of two brain diseases. ALS and MS is the
most similar disease with 1,684 coexpression gene sets. Other brain diseases are listed
with AD, SCH, PD, HD, AUT, and BD in the order of the number of shared gene sets.
This result shows shared mechanism of molecular level among neurodegenerative diseases.
Particularly, SCH, classified by psychiatric disorder, highly correlated with neurodegenerative
disease.

Figure 1. Summary of brain disease combinations sharing common coexpressed gene sets.

Figure 2 shows two types of top 30 genes in gene sets of single brain disease and at least
two brain diseases. To discover the biological meaning and functions of each gene
set from single and multiple brain disease, function enrichment analysis of each gene
list was performed using DAVID. As the results, the fourteen genes (RTN3, LANCL1,
YWHAQ, COX5B, DYNC1H1, HNRNPD, MIF, NEDD8, PRKAR1A, PSMC1, RPL31, TBCA, VDAC3, and
CDC42) in gene sets of one brain disease were significantly enriched only ‘acetylation’
term of SwissProt PIR Keyword with Benjamini corrected p-value of 0.006. The three
genes (AP2A2, CACNG2, and RAB3A) in gene sets of at least two brain diseases were
significantly enriched ‘synaptic transmission’ term of Reactome pathway with Benjamini
corrected p-value of 0.022. This result shows that frequently found genes in gene
sets sharing brain diseases are different from genes of only one brain disease in
terms of cellular function. In particular, frequently observed genes in gene sets
sharing multiple brain diseases are distributed the important pathway of neurological
processes.

Figure 2. Top 30 genes most frequently found in the coexpressed gene sets. (A) Gene sets shared by at least two brain diseases and (B) single brain disease-specific
gene sets.

Coexpressed gene sets across multiple brain diseases are enriched for known disease-associated
genes

We could find 1 and 7 coexpressed gene sets shared by all and 7 different brain diseases,
respectively in Table 1. To find association between the genes in 8 coexpressed sets and brain diseases,
we checked genes in coexpressed gene sets with known brain disease-associated genes
by directly and first-order interacting proteins. First, we collected brain disease-associated
genes using public available disease databases. The collected 2,697 genes comprising
1,310 for AD, 985 for SCH, 534 for MS, 517 for BD, 352 for PD, 186 for AUT and 44
for HD are used. Second, we used our comprehensive protein interaction database, ComBiCom
16], to find nonredundant protein interaction relationships. As the result, the three
genes (ATP1A1 for BD, RBFOX1 for BD, and KLC1 for AD and PD) are directly related
to brain diseases. Moreover, 10 genes (ATP1A1, ATP6V1D, C16orf45, CROCC, KLC1, LUC7L2,
PIAS2, PRKAR1B, RBFOX1, and RPL19) interacts with at least one brain disease-associated
gene. Our approach offers the potential to discovery more reliable and accurate drug
targets covering multiple brain diseases with shared molecular mechanisms. These genes
are also associated with brain function, biological process or pathway, such as hormone
synthesis, metabolic pathway, cell death, synaptic vesicle cycle, and nervous system
development.

Table 1. Coexpressed gene sets shared by more than 7 brain diseases.

Functional characteristics of coexpressed gene sets shared by multiple brain diseases

As we observed previously, there might be functional difference between the single
brain disease-specific gene sets and shared gene sets. We further investigated the
functional characteristics of the 4,299 shared gene sets in comparison with the 3,409
disease-specific gene sets. For this, we carried out a function enrichment analysis
for each shared gene sets and single disease-specific gene sets using the Gene Ontology
(GO) biological process at Benjamini corrected p-value of 0.01. We assigned each gene
sets to one or more of the representative biological processes according to the enriched
biological processes to identify the functional categories that are relatively overrepresented
by the multiple brain disease gene sets compared to the single disease-specific gene
sets. For fair comparison, we normalized the number of assigned gene sets by dividing
the number of identified gene sets in each functional category by the total number
of identified gene sets. Figure 3A and Figure 3B shows 10 functional categories of the highest counts in multiple brain disease-specific
and single brain disease-specific gene sets. In Figure 3C, we only showed 20 functional categories showing the highest and the lowest fold.
While the single brain disease-specific gene sets were more frequently enriched for
the metabolic processes, the functional categories such as “cell cycle”, “neurological
system process” and “cell-cell signaling” had nearly 2-fold greater representation
among the shared gene sets. This is notable since the category “neurological system
process” involves a wide variety of biological processes that directly regulate or
at least substantially affect neurological processes at the phenotypic level. The
overrepresentation of “cell morphogenesis”, “cell death”, “developmental maturation”,
and “cell-cell signaling”, indicates that the shared gene sets are more likely to
have implications for brain cell development and degradation, and neuron-to-neuron
or glia-to-neuron interaction, respectively. Taken together, our data suggests that
the shared molecular bases among multiple brain diseases are more enriched in the
functional categories that are associated with neurological function. This might reflect
the fact that the multiple brain diseases in our study have great similarity in terms
of neurological deficit even though the pathology or other symptoms vary greatly from
disease to disease. Thus, identifying the shared gene sets rather than sing disease-specific
gene sets might increase the chances of discovering more extensive and plausible molecular
bases that are tightly associated with neurological impairment.

Figure 3. Functional distribution of the shared gene sets and the single brain disease-specific
gene sets
. The x-axis shows the representative functional categories (biological processes)
selected. (A)(B) The number of assigned gene sets shared by multiple brain disease
and single brain-specific gene sets in each functional category. (C) Ratio of the
number of shared gene sets over single disease-specific gene sets