Injury narrative text classification using factorization model

The medical narrative text injury dataset is composed of many short documents. Each
document describes the injury events such as how it happened, where it happened and
what caused it. The data can be characterized as high-dimensional sparse, in which
few features are irrelevant, but features tend to be correlated with one another and
generally organized into linearly separable categories. Because of this characteristic,
it is more appropriate to use matrix factorization techniques mapping all the features
to lower dimensional space rather than reduce the number of features. Feature reduction
methods such as Mutual Information, Chi-Squared criteria and odds ratio aim to reduce
the number of features. However, no single feature selection method can work well
with all text categorization methods, as different classifiers evaluate the important
features differently. Simply removal of features may cause of the removal of important
features for some classifiers. On the other hand, feature extraction such as matrix
factorization should work well as all the features are kept and transformed into lower
dimensional space. By doing so, noise is reduced and the latent semantic structure
of the vocabulary used in the dataset becomes more obvious, thus, the lower-dimension
feature space improves the classification performance. We propose to use two matrix
factorization techniques, Singular Value Decomposition (SVD) and Non Negative Matrix
Factorization (NNMF).

Singular value decomposition (SVD)

Let D be identified as the narrative text injury dataset. Singular Value Decomposition transforms
an original matrix D ? Rm×n into left singular matrix U ? Rm×m, right singular matrix V ? Rn×n and diagonal matrix ? ? Rm×n. Formally, D = U ?VT. Let the approximation denoted as which is formed by considering only the top k singular value in ?has the smallest distance to D measured in any unitarily invariant norm 12]. More specifically, can be decomposed to .

Where k is the k largest singular values, Uk ? Rm×k, Vk ? Rn×k, and ?k ? Rk×k.

Model-based Classification: Let D be identified as the narrative text injury dataset containing m number of documents. Let SVD approximation of narrative text documents identified
as , . For each document , it can be represented by k largest singular values as .

The classification problem is to approximate a target function where C = (c1, c2…cx) is a set of pre-defined categories. In this paper, various families of classifiers
are tested in the experiment section, with only the best classifier being applied
on the approximated document matrix .

Memory-based Classification: Inspired by the CMF method 13], the approximation matrix can be represented as

(4)

Because Vk is singular matrix, where I is identity matrix.

Let the training dataset be denoted as and let the testing dataset be denoted as . The singular vectors extracted from training phase are also the representative of the test data as training
data and test data should exhibit the similar characteristics 13].

(5)

(6)

By comparing the similarity of each document in and the document in , the most likely class is assigned to the document in with the class label from the most similar document in . Formally described as,

(7)

Non-negative matrix factorization (NNMF)

It is more natural to use NNMF rather than SVD as each document is the addition of
topics and therefore the coefficients should be all non-negative values. Further,
topics under a document are not completely independent of each other and therefore,
the semantic space capturing the topics are not necessarily orthogonal 24].

Given D ? Rm×n, it can be decomposed to A ? Rm×r and H ? Rr×n where r is the new feature space 25]. Formally,

(8)

The approximation of equation can be solved by either applying Kullback-Leiber(KL)
divergence or Forbeius norm 25]. A is interpreted as a representation of documents in the newly formed space and H is
representation of terms in newly formed space.

Model-based classification: Because A is deemed to be the representation of documents in the newly formed feature space,
A is used in classification. Let ATrain be identified as the document representation from DTrain. Let ATest be identified as the document representation from DTest. During the training phase, the best classifier is selected for training (ATrain, CTrain) which is the class label for each training case. During the testing phase ATest is supplied to the trained classifier for the prediction of the class label for each
testing case.

Memory-based classification: Let the training dataset be denoted as and let the testing dataset be denoted as

(9)

(10)

By comparing the similarity of each document in ATrain and the document in ATest, the most likely class is assigned to the document in ATest with the class label from the most similar document in ATrain More specifically,

(11)