Transcription factor and microRNA-regulated network motifs for cancer and signal transduction networks

The cancer networks and STNs information used in this study are downloaded from KEGG
(July 2013 version). KEGG integrates genomic, chemical, and systemic functional information
to compose a biological database resource.

Outline of workflow

In the last few years, many biochemical pathways information are released by the KEGG
database 37], which are prepared in the XML format. Now, KEGG provides very detail regulatory
information among the molecules. For example, KEGG delivers the following information
on; 1) PPI (PPrel) including both of the activation and inhibition events, 2) gene
expression interactions (GErel) including expression and depression events, 3) post-translational
modification (PTM, i.e. PPrel with activation or inhibit phosphorylation), and 4)
protein-compound interactions (PCrel with activation or inhibition).

A total of 20 cancer networks and 24 STNs have been processed. Given the regulatory
relationships between two genetic components, one can reconstruct network motifs using
the graph theory approach. The present study addresses the following issues;

(i) collect highly confident regulatory relations from cancer networks and STNs,

(ii) analyze the abundance of five common types of network motifs,

(iii) merge interconnected motif types to form CMS,

(iv) perform gene set enrichment analysis for CMS,

(v) construct TMMN,

(vi) perform text mining to validate the motif results, and

(vii) quantify crosstalking between cancer networks and STNs.

Identifying major types of network motifs

There are a number of publicly available network motif detection tools, namely MFINDER
13], MAVISTO 41], FANMOD 42], NetMatch 43], and SNAVI 44]. The main disadvantage of using MFINDER and MAVISTO for network motif detection is
that they are comparably slow and scale poorly as the subgraph size increases 22,42]. We have performed a trial study using FANMOD with KEGG data as input, the tool reports
subgraphs that occur significantly more often than in random networks. The tool does
not provide information on; 1) how many subgraphs are found, and 2) subgraph’s nodes
identities. In other words, no detail of real motif is supplied. For instance, the
output file of FANMOD reports certain motif information, such as frequency of occurrence,
Z-value and p-value, however, it does not report nodes identities, then one does not know which
genetic elements belong to the motif. In other words, given the pairwise information
as input, FANMOD can predict over-represented motifs with certain level of accuracy,
but it does not report nodes identities.

Also, FANMOD has certain limitation, for instance, it cannot identify motifs with
size one and two, i.e. auto-regulation loop and feedback loop. This can be done with
the adjacency matrix description. More details are given in the ‘Results’ section
Table 2.

Table 2. A comparison of motif finding by the adjacency matrix approach and FANMOD

Because of this limitation, we have developed a motif searching algorithm, which is
able to process KEGG networks, such as; the ‘pathways in cancer (overview)’ for human,
and found a cFFL that involves genes PKC, Ras and Raf. It is interesting to note that
this loop participates in coordination of crosstalk between the Ras/Raf and PKC pathways
23,45].

We also tested our motif-searching algorithm for the plant pathogen interaction network,
and found two FFLs, where the first FFL involves CNGCs with Ca++, CDPK and Rboh, and the second FFL involves MEKK1, MKK1 and MPK4. It is known that
the first FFL is associated with Ca++ signaling 46] whereas the second FFL that involves MEKK1, MKK1 and MPK4 is associated with plant
immune responses 47-49]. This demonstrates the usefulness of identifying or matching network motifs with
functional biological modules.

In the graph theory approach, each bio-molecule is represented as a node and regulatory
relation as an edge. One constructs an adjacency matrix to represent the network.
In the adjacency matrix a value of one and infinity (for convenient a very large number
is used in programming) is assigned to represent direct regulation and non-regulating
nodes respectively. For node that is interacting with itself a value of one is assigned.
Row and column indices denote the upstream and downstream node respectively. Below
we briefly described how to perform the motif search.

ARL

This motif type involves a self-regulated gene. Non-zero diagonal elements in the
adjacency matrix represent this type of motif. The time complexity is O(n).

FBL

This motif type involves two genes regulate each other. For any location(i, j) in the adjacency matrix, if the term of (i,j) is ‘1’ and that of (j,i) is also ‘1’, then genes i and j form a FBL. Since there are C(n,2) combinations to be tested, the time complexity is O(n2).

FFL

This motif type involves three genes regulating each other. Depending on the activation
or suppression order, this motif type can be further divided into the so-called cFFL,
and iFFL.

For any triple set (i,j,k), if the terms of (i,j), (j,i),(i,k),(k,i), (j,k),(k,j) are all of ‘1’, then genes i , j and k form a FFL. Since there are C(n,3)*6 combinations to be tested, the time complexity is O(n3).

Bi-fan

Bi-fan motif denotes a topology where two genes regulate the same other two genes.

Select any two rows in the adjacency matrix which have the value of ‘1’ appear at
the same column more than one time. Check whether these two rows are connected, if
not, then determine which two columns have the value of ‘1’ in both rows. The time
complexity is O(n3). To identify all bi-fan motifs, there are C(n,2)*C(n-2,2) combinations to check,
so the time complexity is O (n4).

SIM

SIM motif denotes a topology where a master gene regulates multiple downstream genes.

Select any row in the adjacency matrix and count how many ‘1’ appear in the row. Since
there are at most n ‘1’s in a row and n rows to search; therefore, the time complexity is O(n2).

Coupled motif structures (CMS)

Some of the network motifs are interconnected which lead to observed phenotypic change.
The present study identifies possible CMS for cancer networks. As a preliminary study,
the following six types of CMS are considered; i.e. FBL-FBL, FFL-FFL, bi-fan bi-fan,
FBL-FFL, FBL-bi-fan and FFL-bi-fan. To obtain such structures, gene names of; 1) same
motif type, and 2) different motif types, are pairwise compared. Given the CMS, it
enables reconstructing the global architecture of the whole network from a bottom-up
approach. More complex CMS are also identified, which can be visualized in our web
platform.

The following pseudo-code was designed to identify the six types of CMS.

Input: The network A with n nodes and all basic network motifs (ARL, FBL, FFL, Bi-fan and SIM) of A.

Output: All CMS of network A

Begin

   For i = 1 to n do

      Loop

         If any two network motifs or CMS which include common node i could be

         merged to form a meaningful CMS, then merging these two subgraphs to form

         larger CMS;

      Until no more basic network motifs or CMS including node i could be merged;

   End of For loop

End

A complex network may have underlying topological structures, which can be characterized
by certain topological parameters. We applied the SBEToolbox 50] to compute several topological parameters, i.e. size, maximum degree, bridging centrality
(BRC) and degree centrality (DC), for the CMS.

Size of the network is given by the largest connected cluster. Maximum degree of a
node is node with the highest number of connections.

The bridging coefficient of a node i is defined by:

(1)

where d(i) is the degree of node i, and N(i) is the set of neighbors of node i. Bridging centrality BRC(i) for node i is defined by

The betweenness centrality BC(i) of a node i is computed as follows:

(2)

where s and t are nodes in the network different from i, ?st denotes the number of shortest paths from s to t, and ?st (i) is the number of shortest paths from s to t that pass through i. BRC is the average of BRC(i) over all i.

Degree centrality of a node i, DC(i), denotes the node degree of node i. The DC of node i in a network is defined by:

(3)

where N denotes the total number of nodes in the network and Aij is the corresponding entry value in the adjacency matrix A. DC is the average of DC(i) over all i.

MiRNA-regulated network motifs

It is known that miRNA plays a crucial role in controlling gene expression and biological
process through its interaction with network motifs. For instance, hsa-miR-15a involves
in cell cycle progression 32] through its interaction with the FFL. In particular, we are interested in miRNA target
genes that are related to cancer formation, i.e. OCG and TSG.

Most miRNAs show reduced expression during cancer formation; while some are overexpressed
in cancers. MiR-155 and its host gene, B-cell integration cluster (BIC), are highly
expressed due to MYB regulates BIC in chronic lymphocytic leukemia 51]. Another example is the miR-17-92 cluster, which is activated by the OCG c-Myc and
is highly expressed in B-cell lymphoma. Members of the miR-17-92 cluster (miR-19a
and miR-19b) are essential to mediate the oncogenic activity of the entire cluster
by down-regulated the expression of the TSG, Pten 52]. These studies indicate that some miRNAs may act as OCGs and involve in the initiation
and progression of cancers.

Cancer gene data are obtained from the Tumor Associated Gene (TAG) database 53], Memorial Sloan-Kettering Cancer Center (MSKCC) 54] and National Yang Ming University, Taiwan 55]. After removing overlapped information among the three datasets, we have collected
a total of 659 OCGs, 1023 TAGs and 151 cancer-related genes. MiRNA target gene information
are obtained from miRTarBase (version 4.5) 56] and TarBase (version 5) 57].

To construct TMMN, the TF-regulated miRNA data are retrieved from Chipbase 58]. Since miRNA target genes information are known; then, by matching the cancer motifs
or CMS results, we obtained cancer-specific TMMN. In addition, we labeled target genes
as OCGs or TSGs if they can be found in our cancer gene set collection.

Gene set enrichment analysis

Functional annotation of dense PPI module is given by the Database for Annotation,
Visualization and Integrated Discovery, i.e., DAVID http://david.abcc.ncifcrf.gov/, which accepts batch annotation and conducts gene set enrichment analysis. Set of
CMS involves in a particular cancer network was submitted to DAVID for clustering
of the annotation terms and enriched pathways. With such analysis, enriched pathways
and biological processes related to the cancer network are obtained.

There are several studies on integrating TF, miRNA and target genes expression profile
to construct miRNA-regulated modules for cancer diseases. Zhang et al. 59] applied Sparse Network-regularized Multiple Nonnegative Matrix Factorization (SNMNMF)
algorithm to identify miRNA regulatory modules by combining expression profiles of
both miRNAs and genes, gene-gene interaction (GGI) and DNA-protein interaction. The
study had shown that miRNA-gene modules are enriched in (i) genomics miRNA clusters,
(ii) known functional annotations, and (iii) cancer diseases.

Le et al. 60] developed the regression-based model called PIMiM (Protein Interaction-based MicroRNA
Modules) to predict miRNA-regulated modules by integrating expression profiles of
both miRNAs and genes, sequence-based predictions of miRNA-mRNA interactions and protein-protein
interactions data. Using ovarian cancer as a case study, PIMiM had demonstrated that
it is able to identify cancer-specific miRNAs, presence of expression coherence between
miRNA and mRNA, and enriched functional description.

Li et al. 61] proposed Mirsynergy which applied a two-stage clustering approach to integrate m/miRNA
expression profile, target site information and gene-gene interaction (GGI) to infer
miRNA regulatory modules (MiRMs).

Our results differ in several aspects, (i) TMMN can provide regulatory order among
GGI, (ii) both TF ? miRNA and miRNA ? gene information are obtained from experimentally
verified database, instead of prediction, (iii) we also knew that the target gene
is an OCG or TSG; these information are definite not available in those studies 59-61].

Signal transduction networks (STNs)

Twenty-four STNs are retrieved from KEGG, where only 13 STNs are found to compose
of the proposed motif types. To quantify the number of common motif nodes share between
cancer networks and STNs, we characterized that using the Jaccard index, JI, which is given by:

(4)

where |A ? B|, |A| and |B| denote the cardinality of A ? B , |A| and |B| respectively. A and B denote the sets of motif nodes found in a cancer network and a STN respectively.