A method for developing regulatory gene set networks to characterize complex biological systems

Interpreting co-membership gene set network and regulatory gene set network

When a pair of gene sets in an M-GSN are connected, the interpretation depends on
gene set data types. For pathway gene sets, an M-GSN represents pathway crosstalk;
whereas for GO gene sets, an M-GSN represents protein moonlighting or gene sharing.
When a pair of gene sets in an R-GSN are connected, they are connected by a directed
edge. A directed edge in an R-GSN presents a possibility of one gene set regulating
another.

According to Table 1, the numbers of unique genes from all gene set collections are not much different,
where Reactome and GO Biological Process contain more genes than the others. The total
count of unique genes from all collections is 11,111. We calculated the proportion
of regulatory edges to nodes and normalized the proportion value by using number of
unique genes. We used , where r_iis the number of gene is sets in collection i; g_iis the number of unique genes in collection i; g is the total number of unique genes (11,111); and s_iis the number of gene sets in collection i, to calculate the normalized proportion. In the R-GSN, GO Biological Process has
the highest normalized proportion of edges to nodes (71.08) among the five gene set
collections. These results indicated that pairs of biological process gene sets are
more likely to have a regulatory relationship. In addition, according to the distribution
of gene set sizes (Additional file 1), the five gene set collections have similar distributions. The distributions show
that the five collections contain more small gene sets (size = 2-20) than large gene
sets (size 20).

Table 1. Summary of co-membership gene set networks and regulatory gene set networks for five
gene set collections.

Additional File 1. Distribution of the sizes of gene sets in each collection. This file contains five histograms presenting distribution of the sizes of gene
sets in each of the five gene set collections.

Format: XLSX
Size: 32KB Download file

Comparing the KEGG regulatory gene set network and the KEGG co-membership gene set
network

In the M-GSN of KEGG, an edge between pathways can be considered as a pathway crosstalk
2,3]. Therefore, M-GSNs can be baselines to compare with R-GSNs. A regulatory relationship
between pathways can provide complementary knowledge such as dysfunction of one pathway
affecting function of other pathways. To investigate this knowledge, R-GSNs and M-GSNs
of the KEGG pathways were compared.

The percentage of shared edges between M-GSNs and R-GSNs is the highest for KEGG Pathway
(1461/(2230+3274-1461) = 36.14%). The low percentage of shared edges indicates that
the R-GSN provides complementary information to the M-GSN. In addition, it is important
to note that R-GSNs are constructed from gene regulations which were collected from
high coverage data sources. Therefore, it is unlikely that R-GSNs depend on the quantity
and the quality of experimental data.

Considering both the R-GSN and M-GSN of KEGG pathway gene sets, the most significant
edge of the KEGG R-GSN is a regulatory relationship from â€œCell cycleâ€ to â€œCytokine-cytokine
receptor interactionâ€ with p-value = 6.46E-75 (Table 2); whereas the co-membership edge between â€œCell cycleâ€ to â€œCytokine-cytokine receptor
interactionâ€ has relatively low significance value ( 0.029). In addition, only 4 of
the top 10 most significant regulatory edges are found in M-GSNs. These findings suggest
that the R-GSN reveals additional knowledge to the M-GSN.

Table 2. Top 10 most significant regulatory edges in the KEGG regulatory gene set network.

For the KEGG R-GSN, 7 of the 10 most significant regulatory edges are from the â€œCell
cycleâ€ gene set to other 7 KEGG pathway gene sets (Table 2). This suggests that changing of â€œCell cycleâ€ pathway is likely to affect other pathways.
This finding is corresponding to the fact that a cell cycle is a complex series of
phenomena by which cellular material is duplicated and divided. Therefore if a cell
cycle pathway does not function appropriately, other pathways such as Pathways in
Cancer can be affected.

For the M-GSN, the KEGG pathway gene sets of Alzheimerâ€™s, Parkinsonâ€™s, and Huntingtonâ€™s
diseases have significant co-membership edges linking them together (Table 3). The three co-membership edges connecting the three neurodegenerative diseases are
among the top 10 most significant co-membership edges suggesting that the three neurodegenerative
diseases are highly related. In addition, 5 of the top 20 co-membership edges connect
cancer related pathway gene sets. These 5 edges connect the â€œPathways in cancerâ€ gene
set with 5 gene sets of cancers including â€œSmall cell lung cancerâ€, â€œPancreatic cancerâ€,
â€œMelanomaâ€, â€œColorectal cancerâ€, and Prostate cancerâ€.

Table 3. Top 10 most significant co-membership edges in the KEGG co-membership gene set network.

In addition, degree centrality (DC) of each node in both networks was calculated (Table
4). In the KEGG M-GSN, the â€œPathways in cancerâ€ gene set has the highest value of DC
(0.39). In the KEGG R-GSN, the â€œCell cycleâ€ gene set has the highest DC (1.96) and
the highest out-degree centrality (0.65). The gene set which has the highest in-degree
centrality (0.41) is â€œPathways in cancerâ€. Using directionality information to calculate
in-degree and out-degree centrality of the KEGG R-GSN, we can find the sink, â€œPathways
in cancerâ€, and the source, â€œCell cycleâ€, gene sets.

Table 4. Top 5 degree centrality pathway of KEGG co-membership gene set network (left) and
regulatory gene set network (right).

After the DC value for each gene set was calculated, the correlation between DC of
the KEGG M-GSN and DC of the KEGG R-GSN was calculated. The correlation coefficient
is 0.84 and R-squared value is 0.70 (Figure 1). This result suggests that the gene set which is important in the M-GSN is likely
to be important in the R-GSN and vice versa. We found three interesting outliers.
Pathway number 1, which has regulatory DC = 1.96 and co-membership DC = 0.12, is â€œCell
cycleâ€ suggesting that the Cell cycle pathway does not tend to share genes, but to
have regulatory relationship with other pathways. Pathway number 2, which has regulatory
DC = 1.47 and co-membership DC = 0.39, is â€œPathways in cancerâ€ suggesting that Pathways
in cancer shared high number of genes with several pathways and their genes also regulate
the unique genes of other pathways. Pathway number 3, which has regulatory DC = 0.68
and co-membership DC = 0.005, is â€œMaturity onset diabetes of the youngâ€. This pathway
shares 6 genes with only â€œType II diabetes mellitus pathwayâ€.

Figure 1. Correlation between DC of the KEGG M-GSN and DC of the KEGG R-GSN. Pathway number 1 is Cell cycle; pathway number 2 is Pathways in cancer; and pathway
number 3 is Maturity onset diabetes of the young.

While the correlation between DC of the KEGG M-GSN and DC of the KEGG R-GSN is as
high as 0.84, the topologies of the networks are different (Figure 2). This suggests that the two networks can be used to explain different phenomenon
in biological systems.

Figure 2. Gene set networks of KEGG gene sets. (M) is the KEGG M-GSN and (R) is the KEGG R-GSN. Node colors represent different
classes of pathways. The networks were drawn by Cytoscape using Cytoscapeâ€™s BioLayout.

Constructing an exclusive R-GSN

We constructed a KEGG exclusive R-GSN (R-GSN minus M-GSN) that contains only exclusive
edges in the R-GSN (Figure 3). The correlation of DC between the KEGG exclusive R-GSN and the KEGG R-GSN is 0.81.
The correlation of DC between the KEGG exclusive R-GSN and the KEGG M-GSN is 0.44,
while the correlation of DC between the KEGG non-exclusive R-GSN and the KEGG M-GSN
is 0.84. These suggest that the exclusive R-GSN can reveal important gene sets that
are not likely to be revealed by M-GSN.

Figure 3. Degree distribution of KEGG exclusive R-GSN. This figure show the degree distribution of KEGG exclusive regulatory gene set network.
Because the regulatory gene set network is a directed network, the degree of each
nodes was counted by summing in-degree and out-degree.

The directionality information from a regulatory gene set network revealed â€œsinkâ€
and â€œsourceâ€ gene sets in addition to â€œhubâ€ gene sets, which can be revealed by constructing.
Table 5 shows the top 10 highest out-degree centrality gene sets (right) and top 10 highest
in-degree centrality gene sets (left) of KEGG exclusive R-GSN. â€œCell cycleâ€, â€œp53
signaling pathwayâ€, and â€œTGF-beta signalling pathwayâ€ are among the top 3 highest
out-degree centrality which can be the sources. The top 3 highest in-degree centrality
gene sets, â€œCell cycleâ€, â€œHematopoietic cell lineageâ€, and â€œCytokine-cytokine receptor
interactionâ€, are the sinks.

Table 5. Top 10 highest out-degree centrality and In-degree centrality of the KEGG exclusive
R-GSN.

Comparison of the KEGG co-enrichment network and the KEGG regulatory network

In the previous analysis, we used the KEGG M-GSN as a baseline and compared it with
the KEGG R-GSN. Several relationships between gene sets were found exclusively in
the R-GSN. In order to validate both the KEGG M-GSN and the KEGG R-GSN, the two GSNs
were compared with the KEGG co-enrichment network (E-GSN) obtained from the study
of Jignesh, et al. 11]. To construct an E-GSN, they integrate experimental gene lists and link two gene
sets if the unique genes of the two gene sets are consistently enriched together across
many experimentally derived gene lists. Therefore, edges found in the KEGG R-GSN should
also be found in the KEGG E-GSN.

The total number of edges found in both the E-GSN and the R-GSN is 1,050, which accounts
for 67.48% of the total number of edges in the E-GSN. The total number of edges found
in both the E-GSN and M-GSN is 914, which is equal to 58.74% of the total number of
edges in the E-GSN. In order to calculate the significance value of the number of
shared edges, the KEGG E-GSN was compared with random networks. We randomly generated
1,000 networks using all the 187 gene sets from KEGG. To calculate the significance
value of the number of shared edges between the KEGG E-GSN and the KEGG M-GSN, each
of the 1,000 random networks contains 2,230 edges, which are equal to the number of
edges found in the KEGG M-GSN. Then Fisherâ€™s exact test was used for calculating the
p-value for the number of shared edges. The p-value is 2.2e-16 (Figure 4A). To calculate the significance value of the number of shared edges between the KEGG
E-GSN and the converted KEGG R-GSN, each of the 1,000 random networks contains 3,274
edges. Then Fisherâ€™s exact test was used for calculating the p-value for the number
of shared edges. The p-value is 2.2e-16 (Figure 4B).

Figure 4. Number of shared edges between the KEGG M-GSN and the KEGG E-GSN and between the KEGG
R-GSN and the KEGG E-GSN. (A) A graph showing number of edges in the KEGG E-GSN that shared with the KEGG
M-GSN and shared with 1,000 random networks. For the number of shared edges between
the 1,000 random network and the KEGG E-GSN, the average is 197; the minimum is 162;
and the maximum is 236. (B) A graph showing number of edges in the KEGG E-GSN that
shared with the KEGG R-GSN and shared with 1,000 random networks. For the number of
shared edges between the 1,000 random networks and the KEGG E-GSN, the average is
289; the minimum is 240; and the maximum is 333.

The number of shared edges between the KEGG E-GSN and the KEGG R-GSN is significantly
high. This is corresponding to the fact that a pair of gene sets with strong regulatory
relationship should be connected with a co-enrichment edge. The number of shared edges
between the KEGG E-GSN and the KEGG M-GSN is also significantly high. This is also
corresponding to the fact that a pair of gene sets with a high number of shared genes
should be connected with a co-enrichment edge.

A disease specific regulatory gene set network

The number of shared genes between the AD gene list (Additional file 2) and each gene set in the global gene set collection was counted. Out of the 2,314
gene sets, 261 have significant number of shared genes. Among the 261 AD gene sets,
42 are from KEGG; 59 are from Reactome; 37 are from GO Molecular Function; 105 are
from Go Biological Process; and 18 are from GO Cellular component. Figure 5 shows the degree distribution of AD-specific R-GSN constructed based on these 261
AD gene sets. In addition, Table 6 shows the top 10 AD related gene sets arranged by the significant values. â€œGenes
involved in Lipid digestion, mobilization, and transportâ€ gene set has the highest
significant value, 2.61E-13. The significant value was calculated by using Formula
(1) where there are 46 genes and 16 genes were found in the AD gene list. The AD related
R-GSN of the top 10 gene sets are presented in Figure 6.

Additional File 2. AD gene list. This file contains a list of AD related gens.

Format: TXT
Size: 3KB Download file

Figure 5. Degree distribution of the AD specific R-GSN. Degree distribution of the AD specific R-GSN containing 261 gene sets and 15,178
regulatory edges.

Table 6. Top 10 AD related gene sets arranged by p-value.

Figure 6. A R-GSN of the top 10 AD related gene sets. A regulatory gene set network of the top 10 AD related gene sets. Node colors represent
different collections of gene sets. The networks were drawn by Cytoscape using Cytoscapeâ€™s
BioLayout.

We then investigated the top 10 degree centrality gene sets. Table 7 shows that â€œsignal transductionâ€ from GO biological process has the highest value
of DC suggesting that the signal transduction process is very important in AD. Searching
on PubMed found more than 2,000 publications discussing the relationship between signal
transduction abnormality and Alzheimerâ€™s disease.

Table 7. Top 10 degree centrality gne sets among 261 AD related gene sets.

Considering the directionality of the AD specific R-GSN, â€œsignal transductionâ€ has
the highest in-DC (0.62). Gene sets which have the highest out-DC (0.54) are â€œcellular
protein metabolic processâ€ and â€œcellular macromolecule metabolic processâ€ (Additional
file 3). These results show that directionality information from the R-GSN enables us to
identify, in AD context, a sink gene set, â€œsignal transductionâ€, and source gene sets,
â€œcellular protein metabolic processâ€ and â€œcellular macromolecule metabolic processâ€.

Additional File 3. Centrality values of each gene set in the AD specific gene set network. This file contains centrality values, including degree centrality, closeness centrality,
and betweenness centrality, of 261 AD related gene sets in the AD specific R-GSN.

Format: XLSX
Size: 55KB Download file

Furthermore, the closeness and betweenness of each gene set in the AD specific R-GSN
were computed. Note that the directionality of each edge in the network was considered
when we calculated the shortest path for the network in order to calculate the closeness
and the betweenness. For the closeness centrality, â€œsignal transductionâ€ from the
GO biological process, â€œPathways in cancerâ€ from KEGG, and â€œprotein metabolic processâ€
from the GO biological process have the highest closeness values (0.190, 0.188, and
0.188, respectively). These results suggest that inappropriate functions of the three
gene sets are likely to affect high number of gene sets in AD. For the betweenness
centrality, â€œPathways in cancerâ€ from KEGG, â€œsystem developmentâ€ from GO biological
process, and â€œLeishmania infectionâ€ from KEGG have the highest betweenness values
(1,225.04, 1168.99, and 1146.79, respectively). These results suggest that the three
gene sets are likely to be on the critical path of biological functioning for AD patients.