Development and validation of a classification approach for extracting severity automatically from electronic health records

Assessment of phenotype severity

Severe phenotypes in general are more prevalent in EHRs because in-patient records
contain “sicker” individuals when compared to the general population, which can introduce
something called the Berkson bias 36]. However, in the general population mild phenotypes are often more prevalent than
severe phenotypes.

For condition/phenotype information we used data from CUMC EHRs, which was initially
recorded using ICD-9 codes. These ICD-9 codes were mapped to SNOMED-CT codes using
the OMOP CDM v.4 2]. For this paper, we used all phenotypes (each phenotype being a unique SNOMED-CT
code) with prevalence of at least 0.0001 in our hospital database. This constituted
4,683 phenotypes. We then analyzed the distribution of each of the five measures and
E-PSI among the 4,683 phenotypes. Figure 2 shows the correlation matrix among the 5 severity measures and E-PSI.

Figure 2. Severity measure correlation matrix. Histograms of each severity measure shown (along the diagonal) with pairwise correlation
graphs (lower triangle) and correlation coefficients and p-values (upper triangle).
Notice the condition length is the least correlated with the other measures while
number of medications and number of procedures are highly correlated (r?=?0.88, p??0.001).

Strong correlations exist between both the number of procedures and the number of
medications (r?=?0.88), and the number of comorbidities (r?=?0.89). This indicates
that there is a high degree of inter-relatedness between the number of procedures
and the other severity measures. Cost was calculated using HCPCS codes alone, whereas
the number of procedures measure includes both HCPCS and the ICD-9 procedure codes
as defined in the OMOP CDM. Because cost was calculated using only HCPCS codes, the
correlation between cost and the number of procedures was only 0.63. Also phenotype
measures were increased for more severe phenotypes. This could be useful for distinguishing
among subtypes of a given phenotype based on severity.

E-PSI versus other severity measures

We performed ICA on a data frame containing each of the five severity measures and
E-PSI. The result is shown in Figure 3 with phenotypes colored by increasing E-PSI score and size denoting cost. Notice
that phenotype cost is not directly related to the E-PSI score. Also phenotypes with
higher E-PSI seem to be more severe (Figure 3). For example, ‘complication of transplanted heart’, a severe phenotype, had a high
E-PSI score (and high cost).

Figure 3. Independent component analysis of phenotypes illustrates relationship between E-PSI
and cost.
Independent Component Analysis was performed using all five severity measures and
E-PSI. Phenotypes are colored by increasing E-PSI score (higher score denoted by light
blue, lower score denoted by dark navy). The size indicates cost (large size indicates
high cost). Phenotypes with higher E-PSI seem to be more severe; for example, ‘complication
of transplanted heart’, a severe phenotype, had a high E-PSI score (and high cost).
However, phenotype cost is not directly related to the E-PSI score.

Phenotypes can be ranked differently depending on the severity measure used. To illustrate
this, we ranked the phenotypes using E-PSI, cost, and treatment length and extracted
the top 10 given in Table 1. When ranked by E-PSI and cost, transplant complication phenotypes appeared (4/10
phenotypes), which are generally considered to be highly severe. However, the top
10 phenotypes when ranked by treatment time were also highly severe phenotypes, e.g.,
Human Immunodeficiency Virus and sickle cell. An ideal approach, used in CAESAR, combines
multiple severity measures into one classifier.

Table 1. Top 10 phenotypes ranked by severity measure

‘Complication of transplanted heart’ appears in the top 10 phenotypes when ranked
by all three-severity measures (italicized in Table 1). This is particularly interesting because this phenotype is both a complication
phenotype and transplant phenotype. By being a complication the phenotype is therefore
a severe subtype of another phenotype, in this case a heart transplant (which is actually
a procedure). Heart transplants are only performed on sick patients; therefore this
phenotype is always a subtype of another phenotype (e.g., coronary arteriosclerosis).
Hence ‘complication of transplanted heart’ is a severe subtype of multiple phenotypes
(e.g., heart transplant, and the precursor phenotype that necessitated the heart transplant
– coronary arteriosclerosis).

Evaluation of severity measures

Development of the Reference Standard severe and mild SNOMED-CT codes involved using a set of heuristics with medical guidance.
Phenotypes were considered severe if they were life threatening (e.g., ‘stroke’) or
permanently disabling (e.g., ‘spina bifida’). In general, congenital phenotypes were
considered severe unless easily correctable. Phenotypes were considered mild if they
generaly require routine or non-surgical (e.g., ‘throat soreness’) treatment.

Several heuristics were used: 1) all benign neoplasms were labeled as mild; 2) all
malignant neoplasms were labeled as severe; 3) all ulcers were labeled as mild; 4)
common symptoms and conditions that are generally of a mild nature (e.g., ‘single
live birth’, ‘throat soreness’, ‘vomiting’) were labeled as mild; 5) phenotypes that
were known to be severe (e.g., ‘myocardial infarction’, ‘stroke’, ‘cerebral palsy’)
were labeled as severe. The ultimate determination was left to the ontology expert
for determining the final classification of severe and mild phenotypes. The ontology
expert consulted with medical experts when deemed appropriate. The final reference
standard consisted of 516 SNOMED-CT phenotypes (of the 4,683 phenotypes). In the reference
standard, 372 phenotypes were labeled as mild and 144 were labeled as severe.

Evaluation of the Reference Standard was performed using volunteers from the Department of Biomedical Informatics at CUMC.
Seven volunteers evaluated the reference standard including three MDs with residency
training, three graduate students with informatics experience and one post-doc (non-MD).
Compensation was commensurate with experience (post-docs received $15 and graduate
students received $10 Starbucks gift cards).

We excluded two evaluations from our analyses: one because the evaluator had great
difficulty with the medical terminology, and the second because the evaluator failed
to use the drop-down menu provided as part of the evaluation. We calculated the Fleiss
kappa for inter-rater agreement among the remaining 5 evaluations and found evaluator
agreement was high (k?=?0.716). The individual results for agreement between each
evaluator and the reference standard were kappa equal to 0.66, 0.68, 0.70, 0.74, and
0.80. Overall, evaluator agreement (k?=?0.716) was sufficient for comparing two groups
(i.e., mild and severe) and 100% agreement was observed between all five raters and
the reference-standard for 77 phenotypes (of 100).

Evaluation of Measures at Capturing Severity was performed by comparing the distributions of all 6 measures between severe and
mild phenotypes in our 516-phenotype reference standard. Results are shown in Figure 4. Increases were observed for severe phenotypes across all measures. We performed
the Wilcoxon Rank Sum Test to assess significance of the differences between severe
vs. mild phenotypes shown in Figure 4. The p-values for each comparison were 0.001.

Figure 4. Differences in severity measures and e-psi for mild vs. severe phenotypes. The distribution of each of the 6 measures used in CAESAR is shown for severe and
mild phenotypes. Severity assignments were from our reference standard. Using the
Wilcoxon Rank Sum Test, we found statistically significant differences between severe
and mild phenotypes across all 6 measures (p??0.001). Severe phenotypes (dark red)
having higher values for each of the six measures than mild phenotypes. The least
dramatic differences were observed for cost and number of comorbidities while the
most dramatic difference was for the number of medications.

Unsupervised learning of severity classes

Development of the random forest classifier

CAESAR used an unsupervised random forest algorithm (randomForest package in R) that
required E-PSI and all 5-severity measures as input. We ran CAESAR on all 4,683 phenotypes
and then used the 516-phenotype reference standard to measure the accuracy of the
classifier.

Evaluation of the random forest classifier

CAESAR achieved a sensitivity?=?91.67 and specificity?=?77.78 indicating that it was
able to discriminate between severe and mild phenotypes. CAESAR was able to detect
mild phenotypes better than severe phenotypes as shown in Figure 5.

Figure 5. CAESAR error rates. Error rates for CAESAR’s random forest classified are depicted with severe denoted
by the green line, mild denoted by the red line and out-of-bag (OOB) error denoted
by the black line. CAESAR achieved a sensitivity?=?91.67 and specificity?=?77.78 indicating
that it was able to discriminate between severe and mild phenotypes. CAESAR was able
to detect mild phenotypes better than severe phenotypes.

The Mean Decrease in Gini (MDG) measured the importance of each severity measure in
CAESAR. The most important measure was the number of medications (MDG?=?54.83) followed
by E-PSI (MDG?=?40.40) and the number of comorbidities (MDG?=?30.92). Cost was the
least important measure (MDG?=?24.35).

CAESAR used all 4,683 phenotypes plotted on the scaled 1-proximity for each phenotype
34] shown in Figure 6 with the reference standard overlaid on top. Notice that phenotypes cluster by severity
class (i.e., mild or severe) with a “mild” space (lower left) and a “severe” space
(lower right), and phenotypes of intermediate severity in between.

Figure 6. Classification result from CAESAR showing all 4,683 phenotypes (gray) with severe
(red) and mild (pink) phenotype labels from the reference standard.
All 4,683 phenotypes plotted using CAESAR’s dimensions 1 and 2 of the scaled 1-proximity
matrix. Severe phenotypes are colored red, mild phenotypes are colored pink and phenotypes
not in the reference standard are colored gray. Notice that most of the severe phenotypes
are in the lower right hand portion of the plot while the “mild” space is found in
the lower left hand portion.

However, three phenotypes are in the “mild” space (lower left) of the random forest
model (Figure 6). These phenotypes are ‘allergy to peanuts’, ‘suicide-cut/stab’, and ‘motor vehicle
traffic accident involving collision between motor vehicle and animal-drawn vehicle,
driver of motor vehicle injured’. These phenotypes are probably misclassified because
they are ambiguous (in the case of the motor vehicle accident, and the suicide cut/stab)
or because the severity information may be contained in unstructured EHR data elements
(as could be the case with allergies).

Using the proximity matrix also allows further discrimination among severity levels
beyond the binary mild vs. severe classification. Phenotypes with ambiguous severity
classifications appear in the middle of Figure 6. To identify highly severe phenotypes, we can focus only on phenotypes contained
in the lower right hand portion of Figure 6. This reduces the phenotype selection space from 4,683 to 1,395 phenotypes (~70%
reduction).

We are providing several CAESAR files for free download online at http://caesar.tatonettilab.org. These include, the 516-phenotype reference-standard used to evaluate CAESAR, the
100-phenotype evaluation set given to the independent evaluators along with the instructions,
and the 4,683 conditions with their E-PSI scores and the first and second dimensions
of the 1-proximity matrix (shown in Figure 6). This last file also contains two subset tables containing the automatically classified
“mild” and “severe” phenotypes and their scores.