Supporting systematic reviews using LDA-based document representations

We performed our experiments using five datasets corresponding to completed reviews,
in domains of social science and clinical trials. These reviews constitute the â€œgold
standardâ€ data, in that for each domain, they include expert judgements about which
documents are relevant or irrelevant to the study in question. The datasets were used
as the basis for the intrinsic evaluation of the different text classification methods.
Our conclusions are supported by the Friedman test (Table 3) which is a nonparametric test that measure how different three or more matched or
paired groups are based on ranking. Given that the methods we applied produced roughly
comparable patterns of performance across each of the five different datasets, we
report here only on the results for one of the corpora. However, the specific results
achieved for the other corpora are included as supplementary material (Additional
file 1).

Table 3. Friedman test for five datasets on different kernel functions and documents representation

Dataset

We applied the models to three datasets provided by the Evidence Policy and Practice
Information and Coordinating Center (EPPI-center) 41] and two datasets previously presented in Wallace et al. 6]. These labelled corpora include reviews ranging from clinical trials to reviews in
the domain of social science. The datasets correspond specifically to cigarette packaging,
youth development, cooking skills, chronic obstructive pulmonary disease (COPD), proton
beam and hygiene behaviour. Each corpus contains a large number of documents and,
as mentioned above, there is an extremely low proportion of relevant documents in
each case. For example, the youth development corpus contains a total of 14,538 documents,
only 1440 of which are relevant to the study. Meanwhile, the cigarette packaging subset
contains 3156 documents in total, with 132 having been marked as relevant. Documents
in the datasets were firstly prepared for automatic classification using a series
of pre-processing steps consisting of stop-word removal, conversion of words to lower
case and removal of punctuation, digits and the words that appear only once. Finally,
word counts were computed and saved in a tab-delimited format (SVMlight format), for
subsequent utilisation by the SVM classifiers. Meanwhile, TerMine was used to identify
multi-word terms in each document, as the basis for characterising their content.
Preliminary experiments indicated that only using multi-word terms to characterise
documents may not be sufficient since, in certain documents, the number of such terms
could be small or zero. Accordingly, words and terms were retained as features for
an independent experiment.

BOW-based classification

Table 4 shows the performance of the SVM classifiers trained with TF-IDF features when applied
to all corpora. Due to the imbalance between relevant and irrelevant instances in
the dataset, each positive instance was assigned a weight, as mentioned above. Default
values for SVM training parameters were used (i.e. no parameter tuning was carried
out), although three different types of kernel functions were investigated, i.e. linear,
radial basis function (RBF) and polynomial (POLY). Unlike the linear kernel that aims
to find a unique hyperplane between positive and negative instances, RBF and POLY
can capture more complex distinctions between classes than the linear kernel. As illustrated
in Fig. 2, the BOW-based classification achieves the best performance when the linear kernel
function is used. However, it is necessary to recall that the ratio of positively
(i.e. relevant) to negatively (i.e. irrelevant) labelled instances is approximately
1:9 in our corpora. Hence, even if a classifier labels all test samples as irrelevant
instances, a very-high accuracy will still be obtained. However, for systematic reviews,
it is most important to retrieve the highest possible number of relevant documents;
recall is a much better indicator of performance than accuracy. Secondly, both the
RBF and polynomial kernel functions obtained zero for precision, recall and F₁
-score. This can be attributed to the imbalanced nature of the corpora 42]. Additionally, the BOW representation produces a high dimensional space (given the
large number of unique words in the corpora). In this high dimensional space, the
two non-linear kernels (RFB and POLY) yield a very low performance.

Table 4. Evaluation on all corpora of SVM classifiers trained with TF-IDF features

Fig. 2. Linear kernel function. Comparison between the performance of BOW-based, topic distribution-based
and term-enriched topic classifiers trained using a linear kernel function

Topic-based classification

Topic-based classification was undertaken by firstly analysing and predicting the
topic distribution for each document and then classifying the documents using topics
as features. During the phase of training the model, the topic assigned to each word
in a document can be considered as a hidden variable, this problem can be solved by
using approximation methods such as Monte Carlo Markov chain (MCMC) or variational
inference. However, these methods are sensitive to initial parameter settings which
are usually set randomly before the first iteration. Consequently, the results could
fluctuate within a certain range. The results produced by topic-based classification
are all average results. However, our results show that topic distribution is an ideal
replacement for the traditional BOW features. Besides other advantages, the most obvious
advantage of which is to reduce the dimensions of features for representing a document.
Experimental settings were identical in the evaluation of the two sets of classifiers,
except for the features being topic distributions in one case and BOW in the other.
The optimal LDA model was derived through experimentation with differing numbers of
topics (which can also be referred to as â€œtopic densityâ€). In the experiments performed,
several values for this parameter were explored.

Table 5 shows the results of the evaluation of SVM models trained with topic distribution
features using linear, RBF and POLY kernel functions, respectively. We show how the
performance varies according to different topic density values for the LDA model.
These values were varied from 2 to 100 (inclusive), in increments of 10, and from
100 to 500 in increments of 100 approximately. Generally, each topic density would
correspond to a certain size of corpus and vocabulary. Empirically, the larger the
size of the corpora and vocabulary, the greater the number of topics that is needed
to accurately represent their contents, and vice versa. Tables 6 and 7 show two samples of sets of words and/or terms that are representative of a topic
in the same corpus (youth development). Term-enriched (TE) topics include multi-word
terms identified by TerMine as well as single words, whilst ordinary topics consist
only of single words. From the tables, it can be clearly seen that term-enriched topics
are more distinctive and readable than single-word topics. As the classification performance
was similar to the single-word topic-based classification, a table like Table 5 will not be presented here. However, a comparison of the classification performance
for the three approaches, i.e. BOW-based, topic-based and TE-topic-based will be presented
in the next section.

Table 5. Evaluation on the youth development data set of SVM classifiers trained with topic
features

Table 6. Term-enriched topics

Table 7. Ordinary topics

Comparison of approaches

A comparison of the performance of the BOW-based model (BOW in legend) against the
performance of models trained with topic-based model (TPC) and term enriched-topic
model (TE) is presented in this section. According to the results of using a linear
function for model training (Fig. 2), models based on topic and TE-topic distribution features yield lower precision,
F-score, ROC and PRC but obtain higher recall. For this comparison, the best performing
topic-based model (with topic density set to 150 for youth development corpus) was
used. It can be observed from Fig. 2 that the BOW-based model outperforms the topic- and TE-topic based one in terms of
all metrics except for recall. Figures 3 and 4 illustrate the results of using RBF and POLY kernel functions, respectively, in training
BOW, topic-based models and TE-topic-based model on the youth development corpus.
It can be observed that employing these kernels, the SVM models trained with topic
and TE-topic distributions outperform those trained with BOW features by a large margin.
Another observation is that training using RBF and POLY kernel functions significantly
degraded the performance of BOW-based models. Using RBF and POLY kernel functions,
the BOW-based classifiers perform poorly, with zero in precision, recall and F-score. As noted earlier, high accuracy is not a good basis for judging performance
due to the imbalance between positive and negative instances, i.e. even if a classifier
labels every document as a negative sample, accuracy will still be around 90 %. Figure
5 gives the comparison of different kernel functions using topic features on the youth
development corpus, indicating that taking all measures into account, a linear kernel
function gave the best overall performance, achieving the highest score in every metric
other than recall. However, both RBF and POLY kernel functions outperformed linear,
albeit by only 4 %, on the recall measure, which we have identified as highly pertinent
to the systematic review use-case. We used a generic list of kernel functions ranked
from high to low in terms of recall for topic-based and TE-topic-based feature in
Table 8: POLY RBFLINEAR. For a ranked list of feature types in terms of recall, it is: TPC TEBOW. Additionally, Figs. 6 and 7 show precision-recall and ROC curves achieved by the models.

Fig. 3. RBF kernel function. Comparison between the performance of BOW-based, topic distribution-based
and term-enriched topic classifiers trained using an RBF kernel function

Fig. 4. POLY kernel function. Comparison between the performance of BOW-based, topic distribution-based
and term-enriched topic classifiers trained using a POLY kernel function

Fig. 5. Different kernel functions. Comparison between the performance of linear, RBF and
POLY kernel functions using topic feature

Table 8. The performance of all corpus with different feature selection and kernel functions

Fig. 6. Receiver operation curve: each figure was produced using a kernel function. Left: linear kernel function. Middle: RBF kernel function. Right: POLY kernel function

Fig. 7. Precision-recall curve: each figure was produced using a kernel function. Left: linear kernel function. Middle: RBF kernel function. Right: POLY kernel function