Identifying essential proteins from active PPI networks constructed with dynamic gene expression

In this section, we first construct an active PPI network based on dynamic gene expression
profiles and static PPI network. Then, we identify essential proteins based on the
constructed active PPI network.

Time-dependent model and Time-independent model

Let x = {x1,…, xm,…, xM} be a time series of observation values at equally-spaced time points from a dynamic
system. Wu et al. 23] have adopted AR (autoregressive) model to analyze the time dependence of time-course
(dynamic) gene expression profiles. In 26], the time-dependent relationships can be modeled by an AR model of order p, denoted by AR(p), as follow:

(1)

where ?i (i = 0, 1,…, p) are the autoregressive coefficients, and ?m(m = p + 1,…, M ) represent random errors, which independently and identically follow a normal distribution
with the mean of 0 and the variance of ?2. The system of Model (1) can be rewritten in the matrix form as:

(2)

where

The likelihood function for Model (2) is

(3)

If the rank (X) = p + 1 holds, the maximum likelihood estimates of ? and ?2 are

(4)

and

The value of the maximum likelihood is given by

(6)

In Model (2), the matrix X has p + 1 columns and M ? p rows. Thus a necessary condition for rank(X) = p + 1 is M ? p ? p + 1 or p ? (M ? 1)/2.

On the other hand, the time-independent model is also an autoregressive model with
the order of zero. That is a noisy profile can be modeled by

(7)

where ?0 is a constant number and ?m(m = p,…,M ) are the random errors which are subject to a normal distribution independent of
time with the mean of 0 and the variance of . The likelihood function for Model (7) is

(8)

The maximum likelihood estimates of ?0 and are

(9)

and

(10)

respectively. The maximum values of the likelihood is given by

(11)

where is a (p + 1) dimensional vector whose first component is and others are zeros.

The likelihood ratio of Model (7) to Model (1) is given by

(12)

According to the likelihood principle, if ? in Formula (12) is too small, the series
x = {x1,…, xm,…, xM} is more likely time-dependent than time-independent. The statistic

(13)

follows an F distribution with (p, M ? 2p ? 1) degrees of freedom when Model (7) is true for a series of observations. When
F is very large, thus the p-value is very small, Model (7) is rejected, i.e., observation series x = {x1,…, xm,…, xM} is time-dependent. From Formula (13), one can calculate the probability (p-value) that a series of observations is not time-independent. As the regression degree
in Model (1) is unknown, the p-values are calculated by Formula (13) for all possible orders p (1 ? p ? (M ? 1)/2). The proposed method calls a gene to be significantly expressed (time-dependent)
if one of these p-values calculated from its expression profile is smaller than a user-preset threshold
value.

Construction of the active protein interaction network

Tang et al. 24] use a potential threshold to filter noisy gene expression data, then construct an
active PPI network. In their method the common value of a threshold is applied to
all the genes and time points. Wang et al. 25] propose a method to identify active time points for each protein in a cellular process
or cycle using a 3-sigma principle to compute an active threshold for each gene according
to the characteristics of its expression curve, then construct an active PPI network.
We first filter noisy genes based on time-dependent model and time-independent model,
time-dependent genes expression data is more likely dynamically deterministic than
random while time-independent genes expression data is more likely random than dynamically
deterministic. Those gene expression data are considered to be noises if they are
time-independent and their means are very small. We then use a threshold function
to compute an active threshold for each gene according to their expression data. We
finally construct an active PPI network (NF-APIN) 26]. Our threshold function is described as follows:

(14)

(15)

For each gene, u and ? are the mean and standard deviation of its expression values. The Active threshold is calculated by Formula (14) for all possible values k(0 ? k ? 3). In this paper the value of coefficient k is selected as 2.5. If the expression level of a gene is over its active threshold
at a time point, the corresponding protein is regarded as active at the time point.
For each time point, if two proteins interacted with each other in the static PPI
network are active at the same time point, the proteins and their interaction form
a part of NF-APIN at the time point. The process is repeated until the NF-APIN is
created.

Centrality measures

A PPI network is usually regarded as an undirected graph G = (V, E), where a node v ? V represents a protein and an edge e(u, v) ? E denotes an interaction between two proteins v and u. In our paper, we have described the active PPI network constructed by our strategy
as G’ = (V’, E’), a node v ? V’ represents a protein and an edge e(u, v) ? E’ denotes an interaction between two proteins v and u. We assign N as the total number of nodes in the network. In graph theory and network analysis,
centrality of a vertex measures its relative importance within a graph. At the present,
six classical centrality measures based on network topology are defined as follows:

Degree Centrality (DC). The degree centrality of a vertex v is defined as

(16)

Where deg(v) is degree of vertex v.

Betweenness Centrality (BC). The betweenness centrality of a vertex v is defined as the fraction of shortest paths that pass through the node v.

(17)

Where ?st is the total number of shortest paths from node s to node t, ?st(v) is the number of those paths that pass through v.

Closeness Centrality (CC). The closeness centrality of a vertex v is the reciprocal of the sum of graph-theoretic distances from the node v to all other
nodes in the graph G.

(18)

Where d(u, v) is a natural distance between all pairs of nodes, defined by the length of their
shortest paths.

Subgraph Centrality (SC). The subgraph centrality of a vertex i is the total number of closed walks in which v takes part and gives more weight to
closed walks of short lengths.

(19)

where µk(i) is the number of closed walks of length l starting and ending at protein i, v1, v2,…vN is an orthonormal basis of RN composed by eigenvectors of the adjacency matrix A of the network and ?1, ?2,…?N are the corresponding eigenvalues. where denotes the ith component of vj.

Local Average Connectivity Centrality (LAC). The local average connectivity of a node
v (LAC(v)) is defined as the average local connectivity of its neighbors:

(20)

where Nv is the set of neighbors of node v, Cv is the subgraph G[Nv] besides Nv. For a node w in Cv, deg(w) is its degree.

Edge Clustering Coefficient (NC) 11]. The edge clustering coefficient of Eu,v can be defined by the following expression:

(21)

Where Zu,v denotes the number of triangles that include the edge actually in the network, du and du are degrees of nodes u and v, respectively.