Identification of protein complexes from multi-relationship protein interaction networks

With the completion of the sequencing of the human genome, proteomic research becomes one of the most important areas in the life science. One important task in proteomics is to detect protein complexes based on protein-protein interaction (PPI) data generated by various experimental technologies, e.g., yeast-two-hybrid [1], tandem affinity purification [2], and mass spectrometry [3]. Protein complexes are molecular aggregations of proteins assembled by PPIs, which play critical roles in biological processes. Many proteins are functional only when they are assembled into a protein complex and interact with other proteins in this complex. Protein complexes are key molecular entities to perform cellular functions. Even in the relatively simple model organism Saccharomyces cerevisiae, these complexes are comprised of many subunits that work in a coherent fashion. Besides applications of PPI networks, such as protein function predictions [4] and essential protein discoveries [5–11], prediction of protein complexes is another active topic. Actually, protein complexes are of great importance for understanding the principles of cellular organization and function.

Many computational methods for predicting protein complexes from PPI networks have been developed. Pair-wise protein interactions can be modelled as a graph or network, where vertices are proteins and edges are PPIs. Since proteins in the same complex are highly interactive with each other, protein complexes generally correspond to dense subgraphs in the PPI network and many previous studies have been proposed based on this observation, such as MCODE (Molecular Complex detection) [12], MCL (Markov Cluster algorithm) [13], R-MCL (Regularized MCL) [14], CMC (Maximal Clique algorithm) [15], RRW (Repeated Random Walks) [16], SPICi (Speed and Performance in Clustering algorithm) [17], HC-PIN (Hierarchical Clustering based on Protein-Protein Interaction Network) [18], IPC-MCE (Identifying Protein Complexes based on Maximal Clique Extension) [19], and IPCA (Identification of Protein Complexes Algorithm) [20]. Nepusz et al. [21] proposed an algorithm to find overlapping protein complexes from PPI networks, named ClusterONE (Clustering with Overlapping Neighborhood Expansion). For the convenience of researchers, MCODE, ClusterONE, etc. have been designed as plus-in for protein complex prediction and biological network analysis. ClusterViz [22] is such a Cytoscape APP to complete this work.

However, these abovementioned approaches for extracting dense subgraphs fail to take into account the inherent organization. Recent analysis of experimentally detected protein complexes [23] has revealed that a complex consists of a core component and attachments. Core proteins are highly co-expressed and share high functional similarity, and each attachment protein binds to a subset of core proteins to form a biological complex. Based on the core-attachment concept, some algorithms have been proposed, including COACH (Core-Attachment-based method) [24], CORE [25], MCL-Caw [26], DCU (Detecting Complex based on Uncertain graph model) [27], and WPNCA (a Weighted PageRank-Nibble algorithm with Core-Attachment structure) [28].

In spite of the advances in computational approaches and related fields, accurate identification protein complexes are still a bottleneck. One of the most important reasons is that the PPI network contains a lot of false positives which greatly reduce the complex detection accuracy. To address this problem, biological information other than PPIs has been integrated with network topology to improve the precision of protein complex detection methods. Wu et al. proposed a method called CACHET to discover protein complexes with core-attachment structures from tandem affinity purification (TAP) data [29]. Tang et al. [30] constructed time course PPI networks by incorporating gene expression into PPI networks and applied it successfully to the identification of function modules. Wang et al. [31] proposed a three-sigma method to identify active time points of each protein in a cellular cycle, where three-sigma principle is used to compute an active threshold for each gene according to the characteristics of its expression curve. A dynamic PPI network (DPIN) is constructed for the detection of protein complexes. Li et al. proposed novel algorithms, such as TSN-PCD [32] and DPC [33], to identify dynamic protein complexes by integrating PPI data and dynamic gene expression profiles. Zhao et al. [34] reconstructed a weighted PPI network by using dynamic gene expression data and developed a novel protein complex identification algorithm, named PCIA-GeCo.

There exist complex and diverse relationships among proteins after integrating multiple sources of biological information. However, comparing PPI data is difficult because they are often diverse and play different roles under different conditions. Current existing approaches failed to take into account and combined the interactions with different natures into one interaction effectively. Taking into account the influences of different types of interactions are not the same weight for protein complex prediction, we construct a multi-relationship protein interaction network (MPIN) by integrating PPI network topology with gene ontology (GO) annotation information. Then, a new method named MINE (identify protein complexes based on Multi-relationship protein Interaction NEtwork) is proposed. We have conducted an experiment on yeast data. Experimental results show that MINE outperforms the existing methods in terms of both accuracy and p value.