Clustering PPI data by combining FA and SHC method

The SHC algorithm

The phenomenon of synchronization often appears in physics, it can be expressed as
follows. Two or more dynamic systems both have their own evolution and mutual coupling.
This effect can be either one-way or two-way streets. When meets certain conditions,
the output of these systems will eventually converge and completely be equal under
the influence of coupling, this process is called synchronization. The Kuramotom model
36,37] is applied widely as the simple model of synchronization behavior, the generalized
definition of Kuramotom model is shown as follows:

Definition 1 (Generalize Kuramoto model): The Kuramoto model consists of a population of N coupled phase oscillators ?i(t) whose dynamics are governed by:

(1)

where ?i is its natural frequencies and is distributed with a given probability density g(?).

Each oscillator tries to run independently at its own frequency, while the coupling
tends to synchronize it to all the others.

The sync algorithm is a novel approach for clustering inspired by the powerful concept
of synchronization. It regards each data object as a phase oscillator, and each dimension
coordinates corresponding to a phase value of the oscillator. Each object couples
with data objects in its ?-neighborhood, where ? is the neighborhood radius. In the effect of synchronization coupling, the object’s
coordinates are transformed constantly, and objects with the same coordinates will
be classified eventually to the same cluster, namely synchronization completion. Let
x ? Rd represents an object in the dataset X and xi be the i-th dimension of the object x. The transformation formula of coordinate of x shows as follows.

(2)

where ?-neighborhood is defined in Definition 2 below.

Definition 2 (?-neighborhood): The ?-neighborhood radius of an object is a collection of data with distances to the object
less than ?:

(3)

where dist(x,y) is the metric function of distance and the Euclidean distance is often used. If
the object y ? N?(x), y is called the ?-neighborhood of x, denoted by x ?? y. The relationship of ?-neighborhood between objects is symmetrical, namely if x ?? y then y ?? x.

For reducing the dynamic interaction time of synchronization of data in the sync algorithm,
the concept of neighborhood closures is proposed in SHC algorithm. Objects in a ?-neighborhood closure will reach synchronization complete eventually. So it can detect
the clusters even if the objects have not yet reached the same coordinates by classifying
data in the same neighborhood closures to the same cluster, which reduces the dynamic
interaction time of data.

Definition 3 (?-neighborhood closures): Suppose objects set X’ ? X, in the dynamic process of synchronous clustering, if ?x, y ? X’ satisfies x ?? y, and if ?x ? X, x ?? z, then z ? X’, X’ is called an ?-neighborhood closure, that is, for any object x ? X’, N?(x) = X’ is established.

a1, a2, a3, a4 form a ?-neighborhood closure in the Figure 1, and will reach complete synchronization eventually.

Figure 1. ?-neighborhood closures.

The optimal value of synchronous neighborhood radius needs to be determined in both
the sync algorithm and the SHC algorithm. The SHC algorithm determines synchronous
neighborhood radius by means of the hierarchical search that the sync algorithm does.
The process of hierarchical search for the optimization of the neighborhood radius
shows as follows. Starting in a small neighborhood radius value ?, then adding an increment (marked as ??) to ? at a time (? = ? + ??) until the neighborhood radius is large enough to contain all objects. Clustering
in each neighborhood radius of ?, and it is considered to be optimal when the ? gets the best result of clustering.

The FA

The FA is a random optimization algorithm constructed by simulating the group behavior
of the fireflies. There are two important elements in the FA, the light intensity
and the attractiveness. The former reflects the advantages and disadvantages of locations
of fireflies and the latter determines the movement distances of fireflies attracted.
The optimization process of the algorithm is implemented through updating the light
intensity and the attractiveness constantly. The mathematical mechanism of the FA
is described as follows.

The relative value of the light intensity of fireflies is expressed as:

(4)

where I0 is the initial light intensity (r = 0) related to the objective function value, the higher the value of objective function
is, the stronger the initial light intensity I0 will be. ? is the light absorption coefficient set to reflect the features that the light intensity
decreases gradually along with the increase of the distance and the absorption of
the medium. It can be set to a constant. rij is the space distance between firefly i and firefly j.

The attractiveness of firefly is expressed as:

(5)

where ?0 is the maximum of attractiveness. ? and rij are the same as above.

If firefly i moves to firefly j, the updating of location of firefly i is expressed as:

(6)

where xi(t), xj(t) are the space coordinates of firefly i and firefly j at the time t, ? is step-size in [0, 1], rand is a random factor that follows uniform distribution in [0, 1].

Fireflies are distributed to the solution space randomly first of all. Each firefly
has its own light intensity according to its location, the light intensity is calculated
according to Eq. (4). The firefly with low light intensity is attracted by and moving
to the firefly with higher light intensity. The movement distance depends on the attractiveness
between them calculated by Eq. (5). The location updating of the fireflies is cumulated
based on Eq. (6). There is a disturbing term in the process of updating the location,
which enlarges the search area and avoids the algorithm to fall into the local optimum
too early. Finally all fireflies will gather in the location of the maximum light
intensity.

The proposed clustering algorithm

The sync algorithm clusters objects based on the principle of dynamic synchronization,
which has many advantages in that it reflects the intrinsic structure of the dataset.
For example, it can detect clusters of arbitrary number, shape and size and not depend
on any assumption of distribution. In addition, it can handle outliers since the noise
will not synchronize to cluster objects. However, the running time of the algorithm
consists of two parts primarily: The dynamic interaction time of synchronization of
data and the process of determining the optimal value of synchronous neighborhood
radius, which is too long to process large-scale data.

Aiming to reduce the dynamic interaction time of the sync algorithm, the concept of
?-neighborhood closures is proposed in the SHC algorithm. It classifies objects in
the same neighborhood closures to a cluster even if objects have not yet reached the
same coordinate, which enhances the efficiency of the algorithm by reducing the time
of dynamic interaction of data. However, the SHC algorithm determines synchronous
neighborhood radius by means of hierarchical search that the sync algorithm does.
The hierarchical search for synchronous neighborhood radius not only has low efficiency
but also has two shortcomings. Firstly, the hierarchical search is very difficult
to find the optimal value of synchronous neighborhood radius in a fixed increment.
Secondly, the increment ?? needs to be adjusted according to different data distributions. For example, in the
SHC algorithm, the initial value of ? is set to the average distance of all objects of its three nearest neighbors. The
increment ?? is the different value of the average distance of all objects to its four nearest
neighbors minus the average distance of all objects to its three nearest neighbors.
So the running time of the SHC algorithm is very huge when the dataset is uniform
and dispersive. In addition, we must set ?? small when the data distribution is approximate, otherwise it is hard to find the
optimal value of synchronous neighborhood radius.

The FA is a swarm intelligent optimization algorithm developed by simulating the glowing
characteristics of fireflies, which is speedy and precise in the optimization process.
Using the firefly algorithm to search for the optimal neighborhood radius of synchronous
can overcome the drawbacks of the hierarchical search. It adopts fewer searching steps
for the optimal value of synchronous neighborhood radius and gets more accurate results
than the hierarchical search due to its intelligent searching strategies. So it saves
time on determining the optimal value of synchronous neighborhood radius. In addition,
it is applicable to any data distributions. So we improve the SHC algorithm by means
of the FA and apply the proposed algorithm to clustering PPI data.

Preprocessing of PPI data

The PPI data is expressed as a graph, called PPI network, in which each node represents
a protein and the edge between two nodes represent the interaction between proteins.
In that way, we get an n*n adjacency matrix of nodes. However, the dimension of the adjacency matrix is too big
to deal with. Inspired by the spectral clustering, we use the following way to reduce
the dimension of the adjacency matrix of PPI.

First, a similarity matrix A of nodes is constructed as follow.

(7)

where Ni, Nj are neighbor nodes of nodes u and v respectively. Iij is the common neighbors of i and j, w(i,j) is the weight between i and j to measure the interaction strength, and ? is constant between 0 and 1.

Eq. (7) considers two aspects of the aggregation coefficient of edges and the weighted
aggregation coefficient of edges 38-40]. The first half of Eq. (7) is the aggregation coefficient of edges based on degree,
which is portrayed by means of the ratio between adding 1 to the number of common
neighbors of two protein nodes and minimal value of the number of neighbors of two
nodes. The second half of Eq. (7) is the weighted aggregation coefficient of edges,
which is illustrated by the ratio between the product of summation of weight values
of edges respectively connecting these two nodes (i, j) with their common neighbors (k) and the product of summation of weight values of edges linking these two nodes (i, j) with their corresponding neighbors (s, t). In addition, we use ? to balance the weight of the two parts.

Then constructing Laplacian matrix L of matrix A, the D is the diagonal matrix in which (i, i)-element is the sum of A‘s i-th row.

(8)

Matrix X consists of eigenvectors of matrix L‘s corresponding to the first three eigenvalues and X is normalized. X is an n*3 matrix, in which lines represent the protein objects (corresponding to the protein
nodes in PPI network) and columns are the three-dimensional space coordinates of the
protein objects. Our proposed clustering algorithm is calculated based on X.

Design of solution space

The solution space of the position of the firefly corresponds to the neighborhood
radius of synchronization. The initial light intensity I0 of one firefly is assigned by the calculation result of objective function, see Eq.(9),
which is expressed as the evaluation of clustering results based on the neighborhood
radius of the firefly. Moving to the firefly with higher light intensity is regarded
as to search for the optimal value of synchronous neighborhood radius. The position
of the firefly with the highest light intensity means the optimal value of synchronous
neighborhood radius.

Definition of objective function

We choose the following object function to evaluate the clustering results. Clusters
with higher value of the objective function mean the stronger modularity of clusters,
namely, a better clustering result.

(9)

Where mH is the number of edges that connect points in the cluster Hi , nH is the number of edges that connect points in the cluster Hi with points out of the cluster Hi, w(u,v) is the weight between point u and point v, x is the number of clusters, W is the set of connections.

The first half of the Eq. (9) is the summation of the ratio of its in-degree to the
sum of its in-degree and its out-degree, the second part is the summation of the ratio
of its weighted in-degree to the sum of its weighted in-degree and its weighted out-degree.
The two parts calculate modularity respectively. We can change the proportion of two
parts by adjusting the parameter ?.

Flow chart of the algorithm

Figure 2 is the flow chart of the improved synchronization-based hierarchical clustering algorithm.

Figure 2. Flow chart of the improved SHC algorithm.

The detailed procedures of the improved SHC algorithm are as follows.

Step 1 Construct a similarity matrix A of protein objects, and then get Laplacian matrix L of the matrix A. Matrix X consist of the matrix L‘s eigenvector that the top three eigenvalues corresponded. X is an n*3 matrix, in which the rows represent protein objects and the columns are the three-dimensional
space coordinates of protein objects.

Step 2 The setting of parameters: the number of firefly N, the maximum of attractiveness ?0, the light absorption coefficient ?, step-size ?, Maximum iterations maxiter, iter = 0.

Step 3 Initialize the location of firefly in the solution space of the neighborhood radius
? of synchronization.

Step 4 Do clustering respectively based on the synchronous in ? that each firefly corresponding.

Step 4.1 Find ?-neighborhood closures of protein objects of matrix X. Objects that belong to the same closures are divided into a cluster, and then mark
those objects.

Step 4.2 If all points are marked, return to the result of clustering, otherwise the unmarked
objects couple with the objects in its ?-neighborhood according to the formula (2), and then go to step 4.1.

Step 5 The light intensity of fireflies are assigned by the calculation result of the objective
function (9) according to the clustering result. Compare the brightness of fireflies,
if Ii Ii, calculate the attractiveness according formula (5), and then update the location
of firefly i according to the formula (6).

Step 6 iter = iter+1;

Step 7 If iter = maxiter, go to Step 4, otherwise output the clustering result that the firefly with the highest light intensity.

The time complexity of algorithm

The time complexity of the SHC algorithm is O(T·n^2) in a certain neighborhood radius, where n is the number of nodes, T is the number of synchronization to form ?-neighborhood closures. Assuming that the number of dynamic interaction of synchronization
of data to form ?-neighborhood closures will not change in different neighborhood radiuses, the time
complexity of the SHC algorithm is O(k·T·n^2), k is the number of iterations of searching for the optimal ?-neighborhood radius based on the hierarchical search. Replacing the hierarchical
search with FA results in the decrease of k, it enhances the efficiency of the algorithm.