Validating estimates of prevalence of non-communicable diseases based on household surveys: the symptomatic diagnosis study

Overview of study design

The SD study consisted of two components: data collection and model validation. The
data collection portion consisted of identifying cases of different NCDs in a hospital
and then conducting a questionnaire with the patient at a later date. The data were
collected in Mexico as part of the Population Health Metrics Research Consortium (PHMRC)
Study 22]. The PHMRC Study is an offshoot of the Gates Grand Challenge 13 PHMRC Project, an
international collaborative focused on developing better ways to measure health. The
model validation component used this validation data to test different approaches
to the analytical question of interest.

The SD study developed a questionnaire that focused on 10 NCDs, namely angina pectoris
(ICD-10 I20.9), rheumatoid arthritis (ICD-10 M05-M06), cataracts (ICD-10 H25-H26,
H28, Q12.0), asthma (ICD-10 J45), COPD (ICD-10 J40-J44, J47), symptomatic cirrhosis
(ICD-10 K70.3, K71.7, K74), vision loss (ICD-10 H54), hearing loss (ICD-10 H90-H91),
depression (ICD-10 F32, F33), and osteoarthritis (ICD-10 M15-M19, M47). These causes
were chosen since they contribute considerably to the burden of disease in Mexico,
and because current methods for collecting prevalence data on these conditions are
expensive and time-consuming. This questionnaire also collected socio-demographic
information. The questionnaire was adapted from the World Health Survey 15] and the PHMRC Household Survey 24]. Information on the signs and symptoms of the respondent is collected, but the questionnaire
also asks questions that relate to the respondent’s experience, if any, with health
care providers (health care experience; HCE). These questions ask about whether the
respondent has ever been diagnosed with different conditions, and whether certain
medical procedures or protocols have occurred. The list of items relevant to ascertain
the presence of a disease in the questionnaire and HCE indicators is provided in Additional
file 1.

Procedure

Data collection involved three stages: identifying cases for the 10 conditions of
interest, identifying controls who did not suffer from any of the NCDs, and then implementing
the SD questionnaire at the household of each case and each control.

Cases

A team of trained coders located approximately 1,200 cases (120 of each of the morbid
conditions under study) in 11 public hospitals in the Mexico City area and 120 cases
of depression from a psychiatric hospital.

For each condition, a case was defined to be a patient that a physician had diagnosed
with the condition and who met a specific set of gold standard diagnosis criteria
that was decided on by the PHMRC team. A gold standard diagnosis refers to diagnosis
of a specific disease with the highest level of accuracy possible. This involves checking
that the diagnosis is based on positive results from a laboratory test or appropriate
cabinet and/or checking that the recording and documentation of the appropriate symptoms
of the disease were observed during the development of clinical records. To be acceptable,
the symptoms of the disease must be observed or documented in a medical record by
a physician. The gold standard criteria for each condition are provided in Additional
file 2.

We only included cases living in Mexico City who had an address that was identifiable
through the hospital records. Once the cases were identified, an interviewer visited
each household to administer a SD questionnaire to the cases.

Controls

We located a population of controls from the records of the Automated Detection and
Diagnosis Clinic (CLIDDA) in Mexico City. CLIDDA performs a battery of diagnostic
tests on people who are affiliated with the Instituto de Seguridad y Servicios Sociales
para los Trabajadores del Estado. We defined a control to be someone who attended
the CLIDDA in the last 6 months prior to the data collection, was diagnosed as not
having any history of the morbid conditions being studied, was within a similar sex
distribution and age range as the cases, was living in the urban area of Mexico City,
and whose address was locatable from the CLIDDA records. Individuals with an obvious
other disease were not included. We identified a sample of 240 controls. Once the
controls were identified by trained coders from the CLIDDA records, appointments were
made, and an interviewer visited each household to administer an SD questionnaire.

Signed informed consent was obtained prior to each interview. The project was approved
by the institutional review board of the University of Washington and by the research,
ethics, and biosafety committees of the National Institute of Public Health, Mexico,
and participant institutions.

Processing

The SD dataset was processed into a format usable by statistical models using the
same protocol as described in the PHMRC VA study 22]. Specifically, the duration or continuous survey items are converted to a dichotomous
“long duration” item using a median absolute deviation estimator, where the item is
considered to be endorsed if it is greater than the long duration cutoff. Cutoffs
used in this study are presented in Additional file 3. Categorical items are expanded into being separate dichotomous items for each level
or category of the item. For the purpose of clarity, the term “feature” will be used
to refer to the dichotomized (endorsed versus not endorsed) items or information used
by the model/estimation process, while the term “cause” will be used to refer to “condition”
or “illness” or to healthy controls.

Natural language processing

The SD dataset is composed almost entirely of structured questionnaire items, but
free response and text transcription items are also included. One question on the
survey asked the interviewer to transcribe text found on any drug containers in the
household, and the second free response item asked the interviewer to write down any
other pertinent information about the interview that he/she felt was useful.

We implemented techniques based on text mining and natural language processing to
capture the “free response” information 25]-29]. We were interested in identifying text signals that held some diagnostic value and
then in “tokenizing” the free text into data features that could be used by computational
algorithms. For example, for the text feature “alcohol”, an interview would have a
1 if that word appeared in the free response section and a 0 if it did not. In addition,
some words or expressions are essentially synonymous for data classification purposes
(for example, “alcohol”, “alcoholism”, and “alcoholic”). In a process called stemming,
we treat the root of the word (in this case, “alcohol”) as the actual text feature
instead of the entire word itself. To take care of misspellings, mistranslations,
or variations in medical terminology we utilized the dictionary developed for VA analysis
that mapped roughly synonymous words to a single text feature. We utilized in this
process the TM package in R 30].

Train-test environment

A critical component of developing and validating data classification models is constructing
an appropriate validation environment. A given model must be “trained” on a randomly
selected portion of the dataset and then “tested” on an uncontaminated separate portion
of the dataset. In this study we split the entire dataset into 75% train data and
25% test data, where the components are sampled by the outcome variable (in this case,
the disease). Thus, if there were 100 cirrhosis cases in the full dataset, then 25
would be sampled into the test data and 75 into the train data. This train-test split
is repeated 500 times to conduct 500 simulations and to estimate uncertainty around
the predictive validity estimates.

Previous research in VA has shown that i) predictive validity is artificially enhanced
when test and train composition are similar and ii) the estimated performance of a
method is largely a function of the cause composition of the test dataset 31]. To solve the second problem, following Murray et al. 31], we varied the composition of the test data by resampling with replacement based
on an uninformative Dirichlet distribution.

Models

Four data-driven models for VA classification were tested and validated as part of
the PHMRC study. We were able to adapt each of these models for use in the SD analysis.
Three of the models – Tariff 18], Simplified Symptom Pattern 32], and Random Forest 16] – are capable of diagnosing individual subjects and estimating cause-specific mortality
fractions (CSMFs), or in our case, cause-specific prevalence fractions (CSPF), while
the King-Lu 33] algorithm can only estimate CSMFs (or in our case CSPFs). The selected method for
this analysis was Tariff, which has shown a good performance in previous studies 23] and is described in detail.

Tariff is a simple additive algorithm that uses the calculation of a “tariff” (similar
to a Z-score) for each cause-feature combination in the training data followed by
a summation and ranking function to predict the most likely causes for each subject
in the test dataset. The Tariff for a given cause-feature combination quantifies how
uniquely and strongly predictive a given data feature is for a given cause. The Tariff
for cause i and feature j is calculated as:

where Tariffij is the tariff for cause i, feature j, xij is the fraction of subjects with cause i for which there is a positive response for
item j, median (xj) is the median fraction with a positive response for feature j across all causes,
and interquartile range xj is the interquartile range of positive response rates for feature j averaged across
causes.

For each subject in the SD dataset, we compute summed Tariff scores for each cause:

The Tariff scores for each cause are ranked across all subjects, and the top-ranked
cause for each subject is assigned as the diagnosis for that subject.

Analysis

We assessed and compared the capability of the SD model using the VA performance metrics
described by Murray et al. 31]. SD is capable of i) predicting whether or not an individual suffers from different
NCDs and ii) estimating the fraction of individuals in a population who suffer from
a given condition. Consequently, the performance of SD should be quantified in both
the individual and population domains.

Chance-corrected concordance

Chance-corrected concordance (CCC) is a measure of a method’s ability to correctly
diagnose a condition in an individual. However, because random assignment of N different
causes would be correct 1/N times, this metric must also be adjusted for random chance
31].

The formal calculation of CCC for cause j (CCCj) is:

where TP is true positives, FN is false negatives, and N is the number of causes or
conditions (11 in this study).

Cause-specific prevalence fraction (CSPF) accuracy

Following Murray et al. 31], we used CSPF accuracy as a metric to assess the ability of the questionnaire to
estimate prevalence fractions, analog to CSMF in a verbal autopsy study. In our case,
CSPF accuracy, which is an aggregate measure across all causes k, is formally defined
as:

Where the superscript for CSPF refers to true (“t”) or predicted (“pred”) cause fractions.
The denominator reflects the maximum possible CSPF error in the given test split:

Hence, CSPF accuracy can be described as 1 minus the sum of absolute errors divided
by the maximum error. A CSPF accuracy of 1 would indicate perfect cause fraction predictions,
while 0 would indicate the worst possible model.

We assessed the performance of SD in the two metrics described above and in terms
of cause fraction absolute error, which allows for inspection of its performance in
measuring prevalence fractions for each cause. Each type of validation was conducted
across 500 splits of data. We tested each method under two conditions: with all data
features and with all data features excluding HCE information.

We also analyzed whether SD methods systematically over- or underestimate the prevalence
fractions. Using the true and estimated prevalence fractions from 500 splits, we conducted
linear regressions where the estimated prevalence fraction was a function of the true
prevalence. Stata, R, and Python were used for all analysis and data management.