A novel channel selection method for optimal classification in different motor imagery BCI paradigms

Experiment and data description

In this study, three datasets were used. The specific information about each dataset
and its corresponding experimental paradigm are described as follows:

1.
Dataset 1 from a MI task paradigm. The dataset was made available by Dr Allen Osman of the
University of Pennsylvania 30] for a data analysis competition during the Neural Information Processing System (NIPS2001)
31]. In this experimental paradigm, subjects were asked to imagine left or right hand
movement once the letter “L” or “R” was shown on a computer screen. Subjects then
performed sustained hand movement imagery in a fixed period of time. The specific
experimental process is illustrated in Fig. 1a. Data from eight subjects were used in this study, and the scalp EEG was recorded
from 59 channels (international 10–20 system) with a sampling rate of 100 Hz. For
each subject, the total number of trials is 180:90 trials for left and 90 trials for
right. The length of each trial is consistent, 2.25 s.

Fig. 1. Illustration of three different MI paradigms. This diagram is presented here to provide
a general knowledge of the difference among the three MI paradigms. a MI tasks paradigm; b two-class control paradigm (also named 1-D cursor control); c four-class control paradigm (also named 2-D cursor control). Three experimental stages
presented in each paradigm are rest stage, preparation stage and execution stage.
In the execution stage of paradigms b and c, eyes are open, with real-time visual feedback; whereas in execution stage of paradigm
a, eyes are closed without visual feedback

2.
Dataset 2 from a two-class control paradigm. The experiments were conducted in the Biomedical
Functional Imaging and Neuroengineering Laboratory at the University of Minnesota
according to a protocol approved by the Institutional Review Board of the University
of Minnesota 13]. In this experimental paradigm (Fig. 1b), at the beginning, the screen was blank. Two seconds later, a target appeared at
one of two locations on the periphery of the screen. At 5 s, the cursor appeared in
the center of the screen and subjects can move the cursor horizontally through imagination
of left or right hand movement. Provided with visual feedback of the cursor position,
subjects could make real-time adjustment of their imagination patterns to control
the cursor movement, and this is the major difference with the MI task paradigm. A
trial is finished if the cursor reached the target bar within 6 s, reached the wrong
target, or failed to reach the target within 6 s. So each trial lasts 5–11 s. However,
not all the time points of a trial carry information about the EEG modulation of motor
imageries, so only the execution period of each trial was extracted for analysis,
usually 0.5–3 s. Data from eight subjects were used, and the scalp EEG was recorded
from 62 channels (international 10–20 system) with a sampling rate of 200 Hz. For
each subject, the total number of trials is 180:90 trials for left and 90 trials for
right.

3.
Dataset 3 from a four-class control paradigm. The experiments were also conducted in the Biomedical
Functional Imaging and Neuroengineering Laboratory at the University of Minnesota
according to a protocol approved by the Institutional Review Board of the University
of Minnesota. This paradigm is similar with the two-class control paradigm. The only
difference is that the four-class control paradigm controls the movement of a cursor
in four directions, i.e., MI of left hand, right hand, both hands, and nothing control
cursor move to left, right, up, and down. The experimental process is illustrated
in Fig. 1c. Each trial lasts 5–11 s, and the execution period of each trial was extracted for
analysis, usually 0.5–3 s (same with Dataset 2). Data from eight subjects were used
here, and the scalp EEG was recorded from 62 channels (international 10–20 system)
with a sampling rate of 1000 Hz. For each subject, the total number of trials is 280:70
trials for each direction.

Note that there is a little difference in electrode setups between Dataset 1 and Dataset
2/3 (The electrode setup of Dataset 3 is the same with Dataset 2). Only three peripheral
electrodes were not recorded during the Dataset 1 experiment. The other electrode
locations were consistent with Dataset 2, so this does not affect the results.

Data processing

Dataset 3 had a relatively high sampling rate, so before signal processing, we down
sampled the EEG data from Dataset 3 from 1000 to 200 Hz. In order to be consistent
with the sampling rate of Dataset 1, Dataset 2 and Dataset 3 were also down sampled
to 100 Hz; we found that whether the sampling rate was consistent or not, it did not
affect the result. So Dataset 2 and Dataset 3 were only down sampled to 200 Hz for
the results presented here. Then, the data processing was carried out in the following
steps: spatial filtering, feature extraction and feature normalization.

Surface Laplacian filtering

Raw scalp EEG has a low signal-to-noise ratio, since it is spatially smeared due to
the head volume conductor effect. Thus, in order to improve the quality of EEG, a
surface Laplacian filter 32] was adopted to accentuate localized activity and to reduce diffusion in the multichannel
EEG 33], 34]. The formulation for the surface Laplacian filter is as follows:

(1)

where V j
represents the target channel, V k
stands for neighboring channels, and S j
is the index set of the surrounding channels. Parameter n is the number of neighboring channels, and in most cases, n is set to 4, whereas n is 2 or 3 when the target channel is at the periphery.

Feature extraction

A frequency range from 5 to 35 Hz was focused on, since many studies have indicated
that this range mainly represents the intent of user in MI-based BCI. Furthermore,
the phenomenon of event-related desynchronization (ERD) and synchronization (ERS)
in the mu band (8–12 Hz) and beta band (18–26 Hz) 11], 13], 35] occuring during motor imagination over sensorimotor cortex are also included in this
frequency range. Considering that the related frequency bands are narrow banded, we
decomposed the entire 5–35 Hz into 13 partially overlapping sub-bands 36], using constant-Q scheme, which is also known as the proportional band width. According
to the computing principle, the 13 sub-bands are 5.25–6.75 Hz, 6.0–7.71 Hz, 6.86–8.82 Hz,
7.84–10.08 Hz, 8.96–11.52 Hz, 10.24–13.16 Hz, 11.70–15.04 Hz, 13.37–17.19 Hz, 15.28–19.64 Hz,
17.46–22.45 Hz, 19.96–25.66 Hz, 22.81–29.32 Hz, and 26.07–33.51 Hz, respectively.
We were mainly concerned about the power changes of these frequency bands, and therefore
we extracted the envelope of each sub-band using the Hilbert transform, given that
the envelope could quantitatively reflect the instantaneous power change of a frequency
band. In this study, each envelope was extracted from a trial data. Each feature is
the average of single envelope over trial time.

Feature normalization

Min–max normalization was adopted here to transform feature values into the range
(?1, 1). The normalization was performed according to the formula below.

(2)

where Min and Max are the minimum and maximum values of a feature. The parameters newMin and newMax are the low and upper bound of new range, and X is the feature value that needs to be transformed.

Through the above process, we could obtain 13 normalized features during a trial from
a single channel, and the 13 features corresponded to the aforementioned 13 sub-bands.
When all the channels (59 or 62) were initially used, the total number of features
would be 767 or 806 (13 × 59/62). During the computing process, features from different
channels are concatenated into a feature vector representing one trial.

Channel selection

Performance assessment

Channel selection is an optimization problem which chooses the optimal subset of channels
from the full set available. Our ultimate goal is to maximize the classification accuracy
of distinct motor imageries in MI-based BCIs, so optimal here means that achieving
best accuracy by using least number of channels. Here, the best accuracy means the
optimal testing accuracy in classification, which is obtained by computing the mean
value of testing accuracy of each fold in 10-fold cross validations.

Relief and Relieff

Relief 24] is a widely used method for feature selection in binary classification problems,
due to its effectiveness and simplicity of computation. Relieff 37] is an extension of it, designed for feature selection in multi-class classification
problems. Relief and Relieff are similar in algorithm principle. So here we only take
Relieff as an example to describe the computing principle. For technical details about
Relief, please refer to 24]. A key idea of Relieff is to estimate the quality of attributes according to their
abilities of distinguishing among samples that are near to each other, given in Fig. 2a. At the beginning, the weights for all the features are initialized to zeros (line
1). Then it randomly selects a sample R from the training dataset T (line 3), and selects the nearest k samples from the same class training set (called “Near Hits”) (line 4) and nearest
k samples from each of the other classes (called “Near Misses”) (lines 5 and 6). It
updates the feature weight vector W for all attributes according to Eq. (3). The whole process would be repeated m times, where m is a user-defined parameter. Through the above described process, it could be easily
understood that a large weight means the feature is important for the classification
and a small one means less important. The weight computing equation is as follows:

(3)

where W(f) represents the weight of feature f. Parameter k is the number of nearest samples selected, and m represents the number of repeated times of computing process. C represents one class except for the R’s class. Parameter p is the prior probability of a certain class. Function d(f, R, R o ) calculates the difference between sample vector R and sample vector R o
at the feature f, and R o
could be H j
(Near Hit) or M j
(Near Miss). For numerical attributes, d(f, R, R o ) is defined as:

(4)

IterRelCen. In Relief (Relieff) algorithm, the weight computation is affected by selection of
target sample and selection of nearest neighbors (samples). However, these two aspects
may provide a chance to bring in errors in weight computation. Specifically, (1) target
sample is randomly chosen from the training dataset. Such strategy gives noisy samples
(particularly the samples far away from the center of sample data in the same class)
a chance to be selected. And it is undoubted that the use of noisy samples would easily
bring in bias in weight computation. (2) The selection of nearest neighbors depends
on the distances away from target sample, however the distance computation is determined
by the features participated in. In our EEG dataset, the samples are with high-dimension
features which could be mixed with noises or redundant features. Certainly such features
would interfere with the distance computation between samples, and cause an error
in feature weight computation. To solve the above two mentioned problems, an enhanced
method “IterRelCen” based on the principle of Relief (Relieff) was proposed in this
paper, given in Fig. 2b. The method IterRelCen reconstructed Relief (Relieff) algorithm from two aspects:
First, the target sample selection rule is adjusted. Instead of randomly selecting
sample, samples close to the center of dataset from the same class have the priority
of being selected first (lines 5 and 6) according to formula (5), given that such samples could discriminate different class samples more accurately.
Second, the idea of iterative computation is borrowed to eliminate the noisy features
in samples (lines 1 and 14). In each iteration, the N features with the smallest weights
(N is a user-defined parameter, depending on the required iterative speed) are removed
from the current feature set (line 12). The left features are fed into the classifier
for accuracy calculation (line 13). Repeat this process until the current feature
set is empty. Usually, first removed ones are the features performing worst in discrimination.
With the noisy features being gradually removed, the distance between samples computed
from the rest features will reflect the relationship between samples more and more
accurately.

(5)

Fig. 2. Pseudo code of Relieff and IterRelCen algorithms and flow chart of the whole method.
a Pseudo code of Relieff algorithm; b pseudo code of IterRelCen algorithm; c the flow chart of the whole method for channel selection

Classification algorithm

A support vector machine (SVM) was used as the classification algorithm in this study.
Classical SVM is a technique developed previously 38] to solve the two-class classification problem. The main idea of this typical SVM
is to separate a two-class dataset by finding the maximum geometrical margin between
the two classes.Given a training set of instance-label pairs , where and . In order to build the SVM model, the following optimization problem needs to be
solved:

(6)

The slack variables are introduced in case a small part of the data is nonlinearly separable. The margin
is defined as , so our goal is to make a best trade-off between low training error and large margin . We also adopted a kernel function to map training vectors onto a higher dimensional space. Thus, combined with slack variables, most of the
nonlinear problems can be transformed into linear problems. There are several kernel
functions provided to be chosen, e.g. linear function, polynomial function or radial
basis function (RBF). Here, RBF was used.

Classical SVM only has the ability to discriminate between two classes; however, one
of our datasets was from four-class control paradigm. Thus, we implemented the “one-against-one”
approach 39] here to solve this multiclass classification problem. If k is the number of classes, then classifiers need to be constructed and each classifier model will then be trained
from two-class data.

As shown in Fig. 2c, the channel selection process binds the channel selection method (IterRelCen) with
classification algorithm. The generalization accuracies were estimated by ten-fold
cross validation (tenfold CV) 40], in which a whole dataset is split up into ten folds. In each fold, feature selection
using IterRelCen is performed first based on training set, resulting in a specific
ranking of all features. And then it employs selected features to train the SVM model.
Finally SVM is tested on the corresponding features of the test set to evaluate the
testing accuracy. The classification accuracy was the average of testing accuracies
of each fold, denoted by the following equation:

(7)

where k is 10, and acc i
is the testing accuracy of one fold.

Through the above method, the feature subset that achieved highest accuracy is recognized
as optimal features. And channels hold at least one feature are considered as optimal
channels. Note that by such method some optimal channels may only include several
feature bands (less than 13), while some may retain all the 13 feature bands.