Semi-supervised multi-label collective classification ensemble for functional genomics

In this section, we present the (GM-SMCC) algorithm to address the task of predicting
functional properties of proteins. Our approach is to model the problem as a generative
model process to learn a probabilistic interpretation of the data for the estimation
of the conditional distribution p(c|x) of the data, where c is a functional class and x is a protein instance.

GM-SMCC

Given the dataset X = {x₁, â€¦,x_i,â€¦, x_N} with the attribute features W = {w₁,â€¦ ,w_j,â€¦,w_M}, we set up a generative model for the attribute features of the protein instances
in X (including labeled and unlabeled data) and estimating the conditional distribution
P(c|x) by using the pLSA model originally developed for latent topic analysis. Unlike other
topic model based on latent topics, we adopt protein functional class c_kas latent variables in the pLSA model and fixing p(c_k|x_i) for the annotated proteins in the learning process. The model is given as

where P(c_k|x_i) and P(w_j|c_k) are the probabilities that a protein instance x_iis associated with functional class c_kand the probability that attribute feature w_joccurs in a protein with class c_k, respectively. For efficient optimization, we utilize the log-likelihood. The likelihood
function is transformed into:

(1)

where (x_i,w_j) is the frequency of w_joccurring in x_i, and N,M are the number of proteins and attribute features, respectively.

We exploit the knowledge of network topological structure of the data for better estimation
of the conditional probability P(c|x) based on the assumption that nearby nodes tend to have similar labels. The basic
assumption is that if two nodes x_iand x_sare connected in the network, these nearby nodes tend to share similar class labels,
i.e., the distance of their conditional distribution P(c|x_i) and P(c|x_s) should be similar to each other. Here, we consider the Kullback-Leibler (KL) divergence
to measure the distance of two distributions. Suppose the distribution of P(c_k|x_i) with respect to different classes is represented as a vector z_i= [P(c₁|xi), Â· Â· Â·, P(c_K|x_i)]^T. Then the KLdivergence between zi and zs is defined as

KL-divergence is not symmetric, and thus we use the following symmetric KL-divergence

to measure the distance of two distributions. Here,D(z_i; z_s) is always nonnegative.

As discussed above, our idea is to smooth the distribution P(c|x) over the network. If two proteins are connected with interactions, then their conditional
distributions P(c|x_i) and P(c|x_s) should be close to each other. Such local smoothness in terms of the network topology
is explicitly incorporated into the generative model through a network regularizer

(2)

where E is the adjacency matrix to represent the network topology, E_i,s= 1 if v_iand v_sare connected, and E_i,s= 0 otherwise.

In protein functional properties prediction, proteins generally involve multiple biological
processes and have multiple functions. Thus, it is crucial to take the label correlations
into account to better predict their functional classes. Here, we further generalized
the generative model to support this general setting. Recall that the network regularizer
is used to smooth label probability distribution over the intrinsic network structure.
One hopes that the resulting distribution is able to be smoothed with respect to the
class label correlations. A natural assumption here could be that if two class labels
c_kand c_lare related, then the distribution P(c_k|x_i) and P(c_l|x_i) with respect to different instances should be also similar to each other.

In particular, we construct a label-to-label affinity graph with K vertices where each vertex corresponds to one class label. For each pairwise vertices,
we put edges between them and compute their weighting. There are many choices to define
the weight matrix F = [F_kl] on the affinity graph. Specifically, we use the commonly used dot-product as follows

where Y_k= [Y₁,_k, Â· Â· Â·, Y_N,k]^Tis the label distribution over the instances, such that Y_i,kis nonzero if x_ibelongs to class c_kand the remaining elements are zero. Here, Y_kis normalized to 1. The dot product of two vectors is equivalent to their cosine similarity.

Suppose the vector representation of P(c_k|x_i) with respect to different instances is r_k= [P(c_k|x₁), Â· Â· Â·, P(c_k|x_N)]^T.

we define the KL-divergence between r_kand r_lfor pairwise class labels as follows

By using the label affinity matrix F and the symmetric KL-divergence defined above, we defined the label regularizer

(3)

to smooth the distribution P (c|x).

Incorporating the smoothness terms (2) and (3) into the objective function in (1),
we have the following new objective function

where ? and ? are the regularization parameters. When ? = 0 and ? = 0, maximizing is equivalent to performing learning using the original pLSA model.

For the annotated proteins, their probability distributions P (c|x) are fixed in the learning process. Specifically, the probability assignments are
defined as a uniform distribution based on the known functional class labels as follows

(5)

where lx is the number of functional classes for an annotated protein xi.

For the unannotated proteins, we maximize the log-likelihood function to compute their probabilistic distributions. The resulting probability distribution
P (c|x_i) with respect to a given instance xi indicates the importance of a set of functions to the protein. One hopes that the
P (c_l|x_i) of the relevant labels are close to each other, and their values should be larger
than those of the irrelevant labels. Hence, to make prediction of x_i, we first rank the labels according to P (c_k|x_i). Then we separate the set of labels into relevant and irrelevant label subsets according
to the largest change observed across the sorted P (c_k|x_i). That is, we seek the largest change between two successive P (c_k|x_i) and P (c_k₊₁|x_i) in terms of their sorted orders. Their median value, say t = (P(c_k|x_i) + P (c_k₊₁|x_i))/2, is used as splitting threshold to separate the class labels into relevant set and
irrelevant set, where the the relevant set consists of the labels with probabilities
larger than the threshold t, and the irrelevant set contains the remaining labels.

Model fitting with the EM algorithm

Our proposed approach, GM-SMCC, utilizes the generative model with both network and
label regularization for protein function prediction, and parameter estimation is
different from original PLSA 31] or previous work utilizing PLSA with manifold learning for unsupervised data clustering
32]. Next, we introduce the EM algorithm used in the proposed GM-SMCC approach for finding
maximum likelihood parameter estimates.

In the proposed generative model, we have N K + M K parameters {P (w_j|c_k), P (c_k|x_i)} where the class labels ck are considered as the latent variables. For convenience, we denote these parameters
as ?. We use the EM algorithm which alternates between an expectation step (E-step)
and a maximization step (M-step) to estimate the parameters in the proposed GM-SMCC
model.

E-step

The E-step is the same as in the pLSA model. The posterior probabilities for the latent
variables P (c_k|x_i, w_j) is computed as follows

(6)

M-step

The M-step re-estimation for {P (w_j|c_k)} is the same as that in the pLSA model as follows

(7)

In the M-step, parameters are updated based on the expected complete data log-likelihood
which depends on the posterior probabilities computed in the E-step 31]. The expected complete data log-likelihood of (4) is given by

using the posterior probabilities computed in the E-step.

We need to maximize with respect to the parameter ? subject to the constraints and . Therefore, we augment by the appropriate Lagrange multipliers ?_ito obtain

(8)

Maximization of with respect to P (c_k|x_i) leads to the following set of equations:

(9)

where 1 ? i ? N, 1 ? k ? K.

We expect that if the attribute features of two proteins x_iand x_sare close (i.e., E_isis large), then the distribution P (c_k|x_i) and P (c_k|x_s) are similar to each other, i.e., P(c_k|x_i) will be close to P (c_k|x_s). We have

Similarly, if two functions c_kand c_lare close (i.e., F_klis large), then the distribution P (c_k|x_i) and P (c_l|x_i) are similar to each other, i.e., P (c_k|x_i) will be close to P (c_l|x_i).

We have,

By using the approximation

(9) can be written as

(10)

where 1 ? i ? N, 1 ? k ? K,

and

To obtain the M-step re-estimation for P (c|x), we construct six N K-by-N K matrices: Z, ?, D, B, U, and R.

First, we construct a K-by-K block diagonal matrix D = [D_i,j] based on the adjacency matrix E, where the (i, j)th block of D is a N -by-N matrix D_i,j= [d_i,j,s,t]s,t=1,â€¦,N . All the entries of D are equal to 0 except the diagonal entries

Next, we construct another K-by-K block diagonal matrix B = [B_i,j] where its (i, j)th block is also a N -by-N matrix B_i,j= [b_i,j,s,t]s,t=1,â€¦,N . The entries of B are equal to 0 when i ? j; otherwise, if i = j, then we have b_i,j,s,t= E_st.

Then, we construct a N -by-N block diagonal matrix U = [U_i,j] based on the label correlation matrix F , where the (i, j)th block of U is a K-by-K matrix U_i,i= [u_i,i,s,t]_s,t=1,â€¦,K . All non-diagonal entries of U are equal to 0 and the diagonal entries .

The matrix R = [R_i,j] is another N -by-N block matrix where its (i, j)th block is a K-by-K matrix R_i,j= [r_i,j,s,t]_s,t=1,â€¦,N . Indeed, each R_i,j, for i, j = 1, â€¦, K, is a diagonal matrix r_i,j,s,s= F_ij.

The matrix Z is a K-by-1 block vector, where its k-th entry Z_kis a N dimensional vector defined as follows

The matrix ? is a K-by-K block matrix where its (i, j)th block is a N -by-N diagonal matrix. All the non-diagonal entries of ? are equal to 0 and the diagonal entries

Let y denotes a K-by-1 block matrix where

The system of equations in (9) is approximated using (10) and can be solved using
the following matrix form:

(11)

Thus, the M-step re-estimation for P (c|x) is

(12)

The E-step (6) and M-steps (7) and (12) are alternated until the objective function
(4) converges.

In the initialization step of the EM algorithm, the values of P (w_i|c_k) and P (c_k|x_i) are initialized based on the class priors according to the annotated proteins. We
assume that each feature w_jis conditionally independent to each other given the label c_k. Concretely, P (w_j|c_k) are initialized as , where (w_j, c_k) is the frequency of w_jand c_kco-occuring. The label distribution P (c_k|x_i) for unannotated proteins are initialized as , where (c_k, x_i) = 1 if x_iis associated with c_kand 0 otherwise. In each iteration of the EM algorithm, the probability assignments
of P (c|x) for labeled data are reset according to the known functional class labels as in
Eq. (5).

EGM-SMCC algorithm

The power of the network regularizer in Eq. (4) of our proposed GM-SMCC model lies
in the fact that the linkages of the network generally exhibit predictable relationships
between class labels of linked proteins. Suppose we have an unannotated protein, and
we have a good understanding of the relationship between the functions of this protein
and the functional properties of its labeled neighbors, then we should be able to
make a good prediction of the protein functional properties based on the linkage information.

In the proposed GM-SMCC model, we use the autocorrelation in the protein interaction
network which may provide some inconsistent linkages between the proteins not sharing
similar functional properties. In the studies of functional genomics, if more information
is available, one can derive more effective networks for capturing useful relationships
between the proteins to propagate the supervision knowledge from labeled nodes to
unlabeled nodes.

In the real-world, protein data are associated with various data sources. For example,
the proteins are associated with attribute features; those proteins with similar feature
values may also be similar in their associated functions. Also, the proteins are associated
with a set of functional labels, which can be represented by label features that are
useful for evaluating the pairwise similarity of protein instances. These latent linkages
are already embedded in the data. We can exploit this knowledge to construct the latent
graphs for more effective label prediction.

In this paper, in addition to the PPI network, we introduce two types of latent linkages
to construct latent graphs. Based on the latent graphs we constructed, we extend our
proposed generative model in an ensemble manner to further boost the prediction performance.

Given the adjacency matrices of q latent graphs, the proposed ensemble algorithm, namely EGM-SMCC, is described in Algorithm
1. In the EGM-SMCC algorithm, we learn an individual GM-SMCC model on each of the
constructed latent graph, and then combine the learned models to obtain a more reliable
prediction than that of the model on a single latent graph.

Algorithm 1 EGM-SMCC

Input: , X, Y , the parameters ? and ?

Output: y

Procedure:

1: for i = 1 to q do

2:Â Â Â Â Learn a GM-SMCC model using the constructed latent graph E⁽ⁱ⁾. In the GM-SMCC model, compute the network regularizer in Eq. (2) according to E⁽ⁱ⁾;

3:Â Â Â Â Use EM algorithm to optimize the GM-SMCC model to compute the label probability
distribution y⁽ⁱ⁾;

4: end for

5: Combine the results of q learned models y⁽ⁱ⁾, y⁽ⁱ),â€¦, y^(q)into an ensemble prediction as

The basic idea of constructing latent graphs is to link together the protein nodes,
such that nodes which are closer in the graphs will tend to have the same functional
labels, and the nodes which are disconnected will tend to have different functional
labels. Via the latent linkages in the latent graphs we constructed, knowledge from
labeled nodes can be propagated to unlabeled nodes more effectively, such as the example
in Figure 1. Next, we introduce three type of latent linkages to construct latent graphs that
can be easily computed from the data. For each individual latent graph, we compute
a weight E_ijfor each entry of its adjacency matrix where E_i,jis large indicates two nodes are close together, and vice versa.

Figure 1. An example of latent graphs used in the proposed EGM-SMCC model. (a) PPI latent graph, i,e., the original interaction network, where the ground-truth
label of the center node v₁is â€œ+â€, but it is difficult to predict the label (+ or -) for node v₁since it has the same number of positive and negative neighboring nodes; (b) even-step
random walk latent graph, the directed neighbors with the same label (â€œ+â€) to v₁have triangle edges (red lines), hence they are reachable from v₁using even-step random walks. On the other hand, the indirected neighbors (from the
+ nodes) in network are linked by creating edges (dash line) using even-step random
walks; (c) prediction similarity latent graph using kNN graph construction. In the kNN graph, a node pair share an indirected edge if the two nodes are k-nearest neighbors. In the example, we set k = 3.

PPI latent graph: In our ensemble model, we consider the PPI network as a latent graph, and construct
the adjacency matrix E₍₁₎of the PPI latent graph as follows

where E(i, j) = 1 if node v_iand node v_jare connected in the PPI network, and E(i, j) = 0 otherwise.

Random walk latent graph: When the underlying autocorrelation of original PPI network is small, i.e., some connected
nodes may not share the same class label, the learning method based on the original
PPI network might be affected.

It is observed that proteins that interact with level-2 neighbors (indirect neighbors
in the PPI network) also have a great likelihood of sharing similar characteristics
8]. To this end, we use the idea of even-step random walk with restart (ERWR) 33] to compute the weights of the latent linkages. Intuitively, we assume that linkages
to directed neighbors with the same function class with the target protein of interest
typically have triangle structures (see Figure 1(b)). These neighbors (v₂and v₃) are able to obtain high scores using ERWR because they are well-connected in the
PPI network. On the other hand, ERWR can avoid the immediate neighbors (e.g., v₁and v₂) with inconsistent linkages that negatively influence the predictions because they
are sparsely-connected. ERWR can also exploit the indirect neighbor data by adding
linkages to level-2 neighbors (e.g., v4) that are well-connected to level-1 neighbors.

Given the adjacency matrix E of the PPI network, we compute P = EE and normalize its entries with respect to each column to obtain a normalized transition
probability matrix P . The ERWR random walker iteratively visits neighborhood nodes with transition probability
given in P . Also at each step, it has probability ? (e.g., ? = 0.1) to return to the start node. We define the adjacency matrix E⁽²⁾of the random walk latent graph as follows

where is the steady-state probability matrix after T steps.

Prediction similarity latent graph: We also consider the values of class labels of the annotated proteins as input features
to build a classifier that predicts all unlabeled proteins. Specifically, we use SVM
classifier with probability outputs implemented in the LIBSVM library 34] to compute such that P (c_j|x_i) is the confidence of a protein x_ibelongs to the class c_j. The adjacency matrix E⁽³⁾of latent graph based on the prediction confidences is defined as follows

Here, Y_iand Y_jare normalized to unit length, thus the dot product of the two vectors is equivalent
to their cosine similarity.

In the prediction similarity latent graph, there are many entries being close to zero.
It may not be necessary to consider these entries. Therefore, we use a kNN construction scheme for graph. We connect two nodes v_iand v_jif v_jis among the k-nearest neighbors of v_ior if v_jis among the k-nearest neighbors of v_i35]. It is obvious that the number of edges is O(N ) and the graph is symmetric. We define a sparse adjacency matrix for kNN graph as follows

where is the set of k earest neighbors of v_i. In practice, we find that k does not need tuning. We use k = 10 nearest neighbors for each data set.

Experiments

In this section, we discuss the extensive experimental results to compare the performance
of our proposed methods with the other baselines: SVM, wvRN+RL, ICA, semi-ICA, and
ICML, and show that the proposed methods are able to achieve better performance against
these baselines.

Yeast dataset and baselines

We conduct experiments to predict properties of the proteins corresponding to a given
yeast gene from KDD Cup 2001 36]. In particular, we formulated two prediction problems based on the properties of
the proteins. Problem (1) is to predict the localization of the proteins encoded by
the genes. It is a binary problem, i.e., a protein is localized (or not localized)
to the corresponding organelle. Problem (2) is to predict the functions of the proteins,
which a multi-label problem, i.e., a protein can have more than one function. There
are totally 14 functional classes in the dataset.

The dataset for these two problems consisted 1,243 protein instances and 1,806 interactions
among the pair of proteins interact with one another. The protein features include
the attributes refer to the chromosome on which the genes appears, to whether the
gene is essential for survival, observable characteristics of the phenotype, structural
category of the protein, the existence of characteristic motifs in the amino acid
sequence of the protein, and whether the protein forms larger proteins with others
36,14].

We evaluate the performance of problem (1) by classification accuracy, and problem
(2) by three multi-label learning evaluation metrics, i.e., Coverage, RankingLoss, and MacroF1 37]. These criteria are defined as follows

Coverage evaluates how far we need, on the average, to go down the list of labels in order
to cover all the true labels of an instance:

where rank_s(x_i, c_k) denotes the ranks of class label c_kde-rived from a confidence function s(x_i, c_k) which indicates the confidence for the class label c_kto be a proper label of x_i.

Ranking loss evaluates the average fraction of label pairsthat are reversely ordered for the instance:

where , and denotes the complementary set of Y_i.

MacroF1 is the harmonic mean between precision and recall, where the average is calculated
per label and then averaged across all labels. It is defined as

where p_kand r_kare the precision and recall of the k-th label.

To validate the performance of our proposed algorithms, we compare our approach with
four baseline methods:

1. SVM 34]. This baseline is a feature-based method only using the attribute features of the
proteins for learning without considering using any network information.

2. wvRN+RL 38]. This algorithm is a relational-only method using only the PPI network for prediction.
wvRN+RL computes a new label distribution for an unlabeled node by averaging the current
estimated distributions of its linked neighbors. This process is repeated until reaching
the maximum iteration number.

3. ICA 28]. This denotes a collective classification algorithm which uses both attribute features
and relational features to train a base classifier for prediction. The relational
features are constructed based on the labels of neighbors. ICA uses an iterative process
whereby the relational features are recomputed in each iteration until a fixed number
of iterations is reached. Prior work has found logistic regression (LR) to be superior
to other classifiers such as naive bayes and kNN, as base classifier for ICA. Therefore, we use LR as the local classifier for ICA
in the experiments.

4. semi-ICA 39]. This method extends ICA to leverage the unlabeled data using semi-supervised learning.
There are four semi-ICA variants (KNOWN-EM, ALL-EM, KNOWN-ONEPASS, ALL-ONEPASS) for
semi-ICA, we run all four variants and choose the best one as the result of semi-ICA.

5. ICML 13]. This method extends ICA to handle multi-label learning by constructing additional
label correlation features to exploit the dependencies among the labels as additional
input features to learn base classifier. The ICML algorithm is also based on an iterative
framework similar to ICA.

It is generally more difficult to determine the classifier parameter values when the
number of labeled data available is smaller (which is the focus of this study). For
the SVM classifier, we use the LibSVM 34] with linear kernel as base classifier, and simply set the penalty parameter C = 1.0 for the SVM as default. The maximum number of iterations for ICA, semi-ICA are
set to 10, and we use logistic regression as their base classifier as in 39,13]. While the wvRN+RL uses 1000 iterations. The parameters ? and ? for our proposed method are set to 3 and 0.1. The parameter selection issue is discussed
in the later section.

Results on protein localization prediction

We first consider problem (1) of KDD Cup 2001, i.e., the protein localization prediction
problem. We set ? ? 0 and ? = 0 in our proposed method, and compare GM-SMCC with the learning algorithms: SVM,
wvRN+RN, ICA and semi-ICA. The performance is measured in terms of classification
accuracy.

We compare the performance of the comparison algorithms by varying the number of labeled
data ranging from 3% to 10% with an interval of 1%. For each labeled/unlabeled data
split, we execute an algorithm for 10 runs by randomly selecting data split, and report
the performance (mean and standard deviation) over 10 runs for the algorithms. Figure
2 shows the experimental results. As we can see from the figure, the overall picture
taken from the experiments is clearly in favor of our proposed GM-SMCC. The performance
of GM-SMCC consistently outperforms the other algorithms across different percentages
of labeled data. On average, the accuracy over different percentages for GM-SMCC,
semi-ICA, ICA, SVM and wvRN+RL are 0.845, 0.801, 0.788, 0.788 and 0.666. GM-SMCC performs
best followed by semi-ICA. The 3rd best methods are ICA and SVM. Their performances
are comparable. The relational-only method wvRN+RL performs the worst.

Figure 2. Classification accuracy on problem (1) of KDD Cup 2001 dataset.

We note that a smaller number of label data is the most interesting case for our algorithm
because it is not reliable for prediction due to the inadequacy of supervised knowledge
in the labeled dataset. Thus it is more desired that other data sources can be utilized
together to improve the prediction performance. A closer examination of the results
in Figure 2 show that the smaller the percentage of the labeled data is involved, the larger
improvement GM-SMCC achieves. GM-SMCC achieves the largest improvement against 2nd
best method when there are only 3% of labeled data (GM-SMCC: 0.82 versus semi-ICA:
0.75). We also conduct pairwise t-test at 0.05 significance level to assess the statistical
significance of the differences in performance of GM-SMCC and the other test algorithms
using 3% of labeled data. The performance of GM-SMCC is significant better than those
of the other baseline methods. This result illustrates the advantages of our methods
when there are an extremely small number of labeled data. This is consistent with
our earlier assertions that our approach can work even in the paucity of annotated
proteins by exploring various data sources, including interaction networks, attribute
features, and unlabeled data.

In this study, three types of latent graphs are utilized (see the EGM-SMCC section).
It is thus interesting to investigate the performance of GM-SMCC using a single latent
graph, and the performance of EGM-SMCC utilizing multiple latent graphs. We test the
performance of GM-SMCC and EGM-SMCC on the KDD Cup 2001 dataset with different label
ratio from 3% to 10%. The experimental results are given in Table 1, where GM-SMCC-1, GM-SMCC-2 and GM-SMCC-3 denote the single-graph model using the
PPI latent graph (E⁽¹⁾), the random walk latent graph (E⁽²⁾) and the prediction similarity latent graph (E⁽³⁾), respectively. While GM-SMCC-mean denotes the single-graph model using a latent
graph constructed by averaging the weighing values of E⁽¹⁾, E⁽²⁾and E⁽³⁾.

Table 1. Accuracy (meanÂ±standard deviation) of GM-SMCC and EGM-SMCC against different label ratio on problem
(1) of KDD Cup 2001.

We report the average accuracy and standard deviation of the comparison methods over
10 runs. The numbers in boldface (on each row of the tables) indicate the best results
for each label ratio over the methods. From Table 1, we observe that EGM-SMCC using multiple latent graphs is able to achieve better
performance against the GM-SMCC method using a single latent graph. A reasonable explanation
for this finding is that the different latent graphs have complementary relationship
for prediction. These latent graphs are derived from different sources. When complementary
models learned from these latent graphs are combined in an ensemble, correct decisions
are amplified by the aggregation process. The performance of an ensemble learner is
highly dependent on two factors: one is the accuracy of each component learner; the
other is the diversity among these components. Examining the results in Table 1 shows that the overall performances of the GM-SMCC models generated from different
graphs are reasonably well. This result indicates that each latent graph provides
prediction knowledge from a specific aspect, and their combination leads to a more
robust prediction.

Results on protein function prediction

We also conduct experiments for problem (2) of KDD Cup 2001, i.e., the multi-label
protein function prediction problem. We set ? and ? to be non-zero by considering the network information and label correlation simultaneously.
We compare the proposed algorithms with baseline classifiers: SVM, wvRN+RN, ICA, semi-ICA
and ICML. SVM, wvRN+RN, ICA and semi-ICA are single-label classifiers. For these methods,
we decompose the multi-label problem into a set of K binary classification problems using one-against-all strategy, and train independent
classifier for each single-label problem. This approach is known as the binary relevance
(BR) method 40]. The predictions for all K binary classification problems are combined to make the final prediction.

We compare the performance of our proposed GM-SMCC approach and other baseline algorithms
with varying percentages of labeled data from 3% to 10%. For each percentage, we execute
each algorithm 10 times by randomly selecting the label/unlabel data split from the
dataset. Then we report average results as well as standard deviation of each compared
algorithms over 10 runs. The result is shown in Figure 3. In order to keep consistency with the Coverage and RankingLoss evaluation metrics, we use 1-MacroF1 instead of MacroF1. Thus, the smaller the value of the metric, the better the performance of the algorithm.
We see from Figure 3 that GM-SMCC (the black line) has the best performance (lies under the other curves)
across all evaluation metrics and label ratios. Semi-ICA is the second best method.
In the comparison, SVM performs poor in terms of Coverage. On the other hand, wvRN+RL, ICML and ICA perform poor in terms of MacroF1. Recent studies 41] have shown that one multi-label learning algorithm rarely outperforms another algorithm
on all criteria because the evaluation measures used in the experiments assess the
learning performance from different aspects. In the experiments, we find that GM-SMCC
consistently out-performs other algorithms across all label ratios. On average, ICAM
achieves Coverage improvement of 0.35 (GM-SMCC:3.90 versus semiICA:4.25), RankingLoss improvement of 0.01 (GM-SMCC:0.104 versus semiICA:0.114), and 1-MacroF1 improvement
of 0.068 (GM-SMCC:0.640 versus semiICA:0.708) against the second best method. This
result indicates that the proposed GM-SMCC algorithm is effective for the multi-label
protein function prediction task.

Figure 3. The performance of the algorithms with varying percentages of labeled data on problem
(2) of KDD Cup 2001. (a)Coverage; (b)RankingLoss; (c)1-Macro-F1.

Similar to the experiments for protein localization prediction, we also conduct experiments
to examine the effect of the proposed EGM-SMCC method (integrating multiple latent
graphs) for enhancing the prediction performance against the GM-SMCC method using
a single latent graph. GM-SMCC-1, GM-SMCC-2 and GM-SMCC-3 denote the single-graph
model using (E⁽¹⁾), (E⁽²⁾) and (E⁽³⁾), respectively. GM-SMCC-mean denotes the single-graph model using a latent graph
constructed by averaging the weighing values of E⁽¹⁾, E⁽²⁾and E⁽³⁾.

We compare GM-SMCC and EGM-SMCC with respect to different percentages of labeled data
from 3% to 10%. For brevity, we just report Coverage and RankingLoss. The results are given in Figure 4 and 5. The percentage of labeled data is illustrated on the horizontal axis. According
to the figures, we can see that EGM-SMCC consistently outperforms the GM-SMCC algorithms
using a single latent graph because more information are utilized. This result demonstrates
the effectiveness of our proposed EGM-SMCC method for multi-label protein function
prediction.

Figure 4. The Coverage of EGM-SMCC and GM-SMCC with various latent graphs: GM-SMCC-1 (PPI latent graph),
GM-SMCC-2 (random walk latent graph), GM-SMCC-3 (prediction similarity graph), GM-SMCC-mean
(a single graph model averages the weighting values of E⁽¹⁾, E⁽²⁾and E⁽³⁾) with respect to different percentages of labeled data (%) for the problem (2) of
KDD Cup 2001.

Figure 5. Same as Figure 4, but for RankingLoss evaluation metric.

Convergence study

The objective function in Eq. (4) is optimized for classification prediction. Here, we investigate how fast
the algorithm converges. Figures 6(a) and 6(b) show the convergence curves of the proposed algorithm on the problem (1) and (2)
(at 5% label ratio), respectively. The x-axis is the number of iteration number in the process of optimizing the objective
value O and the y-axis is the value of successively computed objective value . We see that the algorithm converge within 10 iterations. The required computational
time for problems (1) and (2) are 10.5 seconds and 10.3 seconds using our MATLAB implementation,
respectively.

Figure 6. The convergence curves of the proposed method. (a) Convergence curve on the problem (1); (b) convergence curve on the problem (2).

Parameter sensitivity

In our proposed GM-SMCC method, the regularization parameters ? and ? quantify the importance of the network regularizer and label regularizer in the objective
function (4). These parameters also determine the learning setting. Our framework
is formulated in single-label collective classification learning by considering ? ? 0 and ? = 0, i.e., we solve single label learning problem for the problem (1). On the other
hand, our framework is formulated in multi-label collective classification learning
when ? ? 0 and ? ? 0, i.e., we consider the label correlation in the learning process for the problem
(2).

We examine the parametric sensitivity of our GM-SMCC approach with respect to parameter
? by fixing ? = 0 and varying ? on problem (1). Figure 7(a) illustrates the accuracy of GM-SMCC with different ? values from 0 to 30 on the protein localization prediction task using 5% label ratio.
When ? = 0 the accuracy is low, since no network information is used in this case. This also
provides evidence of the advantages of the network regularization in the proposed
method. When ? becomes large, the accuracy increases. The plateau in the accuracy curve from 1 to
30 shows that the proposed GM-SMCC achieves fairly stable performance with different
value of ?. It implies that the method is robust when a different value of ? is selected. We find that GM-SMCC presents good classification performance when ? = 3.

Figure 7. The performance of the proposed method with different parameter values on the KDD
Cup 2001 dataset. (a) The accuracy of different ? values on problem (1); (b) the Coverage of different ? values on problem (2).

Next, we fix ? = 3 and vary ? from 0 to 0.4 on problem (2) using 5% label ratio. The result is given in Figure 7(b). We observe that when ? = 0 or ? = 0.4, the performance is poor. It is evident that the smallest Coverage is achieved at ? = 0.1. Therefore, we set ? = 3 and ? = 0.1 in all the comparisons.

Interaction relations

Our proposed method using the objective function in Eq. (4) is capable characterizing
the interaction relations among the genes code for proteins, and these proteins tend
to localize in various parts of cells in order to perform crucial functions. We construct
an extended graph data set for the KDD Cup 2001 data, where is the known interactions among the proteins and is the feature set of the proteins. Each is an extended feature vector for the i-th protein/gene by integrating its attribute features, localization and functional
labels together as follows: , where x_iis the attribute vector, and are the localization label features and function label features with respect to ith instance. Given a new instance , the interaction between and is estimated by the cosine similarity between their conditional probability vectors
obtained from the proposed method. The resulting similarity ranges from 0 to 1, with
0 indicating two instances are independent, and 1 indicating two instances are highly
interrelated. We apply the cosine similarity measure to evaluate the interaction relations
of 5 randomly selected genes (G238510, G234935, G235158, G237021, G234980) to other
genes in the KDD Cup 2001 dataset. Table 2 shows the interesting interrelations discovered by previous studies with respect
to the evaluated genes. In general, we can see that these interrelated genes tend
to have large similarity values. This shows the advantages of using our proposed method
to detect the interactions. Biologists can use the method to identify related genes
and to further investigate their interactions.

Table 2. Selected interrelated genes and their similarity computed by the proposed GM-SMCC
method.