A method for predicting protein complex in dynamic PPI networks

Datasets and evaluation metrics

In our experiments, we choose four high-throughput yeast PPI datasets including Gavin dataset [23], Krogan dataset [24], MIPS dataset [25] and STRING dataset [26], respectively. In particular, STRING dataset is now one of the largest PPI datasets, which integrates yeast PPI data from the four sources, including high-throughput data, co-expression data, genomic context data and biomedical literature data. The statistics of the four yeast PPI datasets is listed in Table 1.

Table 1

The statistics of PPI datasets in experiments

The gene expression data used in our experiment is GSE3431 [27] downloaded from Gene Expression Omnibus (GEO). GSE3431 gene expression data is an expression profiling of yeast by array affymetrix, which includes the expression profiles of 9,335 probes. The experimental design of GSE3431 is 12 time intervals per cycle, and approximately 25 min per time interval. Therefore, there are 12 active time points (T1,T2,…,T12) for each gene in a cycle. We construct four dynamic PPI networks to integrate high-throughput PPI data and gene expression data. DPN_Gavin, DPN_Krogan, DPN_MIPS and DPN_STRING are constructed by integrating gene expression data GSE3431 with the Gavin dataset, Krogan dataset, MIPS dataset and STRING dataset, respectively.

The benchmark protein complex dataset CYC2008 [28] includes 408 manually curated heterometric protein complexes, which is used to evaluate the protein complexes predicted by our method.

If NA(P,B) is 1, it means that the identified complex P(V
P
, E
P
) has the same proteins as a known complex B(V
B
, E
B
). On the contrary, if NA(P,B) is 0, it indicates no shared protein between P(V
P
, E
P
) and B(V
B
, E
B
). We considered P(V
P
, E
P
) and B(V
B
, E
B
) to match each other if NA(P,B) was larger than 0.2, which is the same as most methods for protein complex identification [6].

where N
ci
is the number of identified complexes which match at least one known complex, and N
cb
is the number of known complexes that match at least one identified complex. Identified_Set denotes the set of complexes identified by a method and Benchmark_Set denotes the gold standard dataset. Precision measures the fidelity of the predicted protein complex set. Recall quantifies the extent to which a predicted complex set captures the known complexes in the benchmark set. F-score provides a reasonable combination of both precision and recall, and can be used to evaluate the overall performance. To keep our evaluation metrics as the same as the most studies, we choose F-score as the major evaluation metrics.

Recently, sensitivity (Sn), positive predictive value (PPV) and accuracy (Acc) have also been used to evaluate protein complex prediction tools. Acc represents a tradeoff between Sn and PPV. The advantage of the geometric mean is that it yields a low score when either Sn or PPV are low. A high degree of accuracy thus requires a high performance for both criteria. These definitions have been described in detail by Li et al. [6]. In our experiments, we also report Sn, PPV and Acc of our method on different PPI datasets.