Towards smart energy systems: application of kernel machine regression for medium term electricity load forecasting

Kernel machines

Analytical models that can be expressed as a function of a kernel are known as kernel
machines (Bishop 2006]). A kernel is any valid mathematical function that can be written with respect to
the dual representation. The general form of the dual representation is given by:

(1)

with (x) being any analytical function known as basis function, and k(x, x) representing a kernel function. In general, formulating a function using Eq. (1) is known as the kernel trick. A few examples of common kernel functions are the
linear and the polynomial kernels whose analytical formulas are given respectively
by (Bishop 2006]):

(2)

(3)

Beyond the widely known kernels, new valid kernels may be created by composition of
two, or more, valid kernels by applying the operations of addition and/or multiplication
(Rasmussen 2006]). The selection of an appropriate kernel function is a main design choice that must
generally be made by the designer according to the specifications of the problem at
hand.

Gaussian process regression

The set of random variables that has a joint Gaussian distribution is defined as a
Gaussian process. A Gaussian process is fully determined by its mean m(x) and covariance function C(x, x?), and therefore, the Gaussian process takes the form:

(4)

where it is common to assume for convenience that m(x) = 0.

Gaussian processes are applied in regression problems where they deal with problems
of predicting continuous parameters. Derivation of Gaussian process regression (GPR)
has as a point of start the simple linear regression:

(5)

where w i
are the regression weights and ? i
are the basis functions. Equation (5) may be written in vector form as given below:

(6)

Next, a prior normal distribution over the model weights is adopted:

(7)

where 0 represents the mean vector, is the variance equal for all individual weights, and I is the identity matrix. Therefore, the distribution over the vector output y is also normal:

(8)

Regression problems require taking into account noisy observed target values. If ? n
denotes the additive noise with zero mean and variance , then the target values become:

(9)

Hence, the distribution over the target variables is also normal

(10)

In Gaussian process regression the Bayesian formalism is applied in order to infer
a predictive distribution, i.e. a mean value and the associated variance. The prediction
over the target tN+1
for an unknown input x(N+1)
is based on the previous observed targets tN
and the respective inputs xN
and thus the predictive distribution becomes

(11)

where it is apparent that the predictive distribution depends on the inverse of the
covariance matrix CN+1
. In order to ease computation of the predictive distribution parameters, the covariance
matrix CN+1
is subdivided into four submatrices (Williams 2002])

(12)

with CN
being the covariance matrix of the N observations, k being a vector of length N encompassing the covariances between the N + 1 and each of the rest N points, and k being the scalar value of the variance of the point N + 1. Thus, it can be shown (Mackay 1998]) that the parameters of the normal predictive distribution, i.e. the mean and the
covariance over N + 1, are given by the following formulas respectively:

(13)

(14)

where it is noted that both equations depend on covariance matrix CN
instead of CN+1
.

Relevance vector regression

In the current manuscript we consider the regression form of relevance vector machines,
which is known as relevance vector regression (RVR). In deriving RVR, initially, we
assume that the target variable t given an input x follows a normal distribution:

(15)

where ?
2
is the variance of the data noise while the mean value y(x) is given by:

(16)

where is a valid function called the basis function, M is the population of basis functions and w is the weight vector. By using Eq. (16) and kernel functions, RVR is modeled as below:

(17)

with b is the bias term and N is the population number of known observations (i.e., size of training dataset).
Next, we consolidate the N input observations into a single matrix X, and the respective N outputs into a vector t. Thus, we get a likelihood function:

(18)

and a prior distribution over the weight vector w:

(19)

with ?n
being the variance of weight wn
and M equal to N + 1. At this point we plug into the Bayes formula both Eq. (18) and (19) and hence we get the posterior distribution over w:

(20)

where mean is taken by:

(21)

and respective variance by

(22)

with A = diag(?
i
) and ? = ?; K is a (N + 1)x(N + 1) dimensional matrix with elements given by the kernel function k(x
n
, x
m
).

At this point it should be said that the unknown parameters ?i
and ?2
are evaluated by maximizing the logarithmic marginal likelihood:

(23)

where t = (t1
,…,tN
)
T
and C is a N × N dimension matrix given by:

(24)

where I is respectively the identity matrix.

Maximization of the marginal likelihood in Eq. (23) with an appropriate iterative method allows evaluation of its parameters. Therefore,
the computed optimal values for ? and ?
2
are equal to ?* and (?
2
)* respectively. Some of the elements of the vector ?* are driven to infinity and thus the posterior distribution of their weights is normal
with both mean and variance being equal to zero. As a result, the corresponding kernel
functions have no contribution in prediction making driving the output to depend exclusively
on the non–zero weighted kernels. The inputs associated with non-zero weighted kernels are called relevance vectors.

Therefore, RVR provides a predictive distribution over the target value t of a new input x:

(25)

with mean to be obtained by

(26)

and variance by:

(27)

where is vector of basis functions with non-zero elements for relevance vectors and zeros
for the rest.