Why item response theory should be used for longitudinal questionnaire data analysis in medical research

Structural model for longitudinal latent variables

A structural model (also known as latent regression model) describes the relationships
between predictors and latent variables while addressing additional dependencies between
the repeatedly measured latent variables. A well-known method to account for the nested
structure of longitudinal data is multilevel modeling (or mixed modelling, random
effects modelling, hierarchical linear modelling) as the structural model. Advantages
of using multilevel modelling for longitudinal data analysis are that it is not necessary
that subjects are measured on the same time points nor do follow up times need to
be uniform. Furthermore, the model is capable of handling time-invariant and time-variant
covariates. Also, it is possible to estimate subject-specific change across time.
The following multilevel model will be considered,


?ij=?j+eij?j=?+uj
(1)

where ? ij
is the latent variable location of person j for measurement occasion i, ? j
is the random intercept representing the average latent variable location of person
j over measurement occasions; both error terms are normally distributed with e ij
?~ N(0,??2
), and u j
 ~ N(0,??2
). The variance parameter ?2
of the multilevel model is the variance between persons (i.e. level-2 variance) whereas
?2
is the repeated measurement variance (i.e. the variance of the measurements within
person; level-1 variance).

Measurement part of the model

To estimate the latent variable ? ij
that is used in the structural model as described in equation Eq. 1, CTT or IRT-based
methods can be used. Lord and Novick 24] pp. 44 describe the basic equation for the composition of the observed score, X gij
, for the latent variable, T gij
, for person j on measurement occasion i on questionnaire g as


Xgij=Tgij+Egij,
(2)

where T gij
is the true score, and E gij
the error of measurement. The observed score consists of the true score and the error
of measurement, which is assumed to be unbiased. When making this assumption about
the measurement error, numbers can be attached to the answering categories of the
items and summed over all items of the questionnaire. Then, a test score (i.e., sum-score)
can be defined as


?ij_CTT=?k=1KXkij,
(3)

where the response pattern for person j on measurement occasion i is given by (X1ij
, …, X Kij
), and where K represents the number of items in the questionnaire. These sum-scores are the CTT
estimates for the latent variable and used as outcome variable in the longitudinal
analysis (i.e. the structural model). There are two main problems with this way of
quantifying latent variables. The first issue is that the characteristics of the test
and the subject are inseparable, i.e. they cannot be interpreted without the other,
which makes sum-scores population dependent. The second problem is that the standard
error of measurement is assumed to be the same for all subjects, although some sum-scores
are more informative about the latent variable than others. That is, it is much more
likely that different subjects are measured with different precision. For example,
extreme high or low sum-scores are more unreliable compared to average sums scores,
meaning that the extreme sum-scores are less likely to distinguish between the subjects
than the sum-scores in the middle of the scale.

The item response patterns are more informative about the latent variable than the
aggregated sum-scores, which ignore differences between response patterns leading
to the same sum-score. For example, when answering ‘yes’ to 10 out of 20 dichotomous
items (1?=?’?yes?’,?0?=?’?no?’), the sum-score of 10 can be obtained in 20?!/10?! ways. Under IRT different scores
are assigned to the different response patterns all leading to the same sum-score,
making it possible to distinguish the scores of respondents with similar sum-scores.

Using IRT modeling is an accepted way to account for the differences in measurement
precision between persons 25]–29] and can be used to estimate scores for the latent variable. In IRT, the relation
between the unobserved latent variable ? and the observed item scores are described by item characteristic curves that model
the probability of observed item responses. As a result, the item and latent variable
estimates in IRT modeling are not dependent upon the population 30]. Fig. 1 depicts an example of item characteristic curves for an item with four response categories
where the probabilities of choosing a certain category are plotted against the latent
variable. An IRT model describes the relationship between latent variables and the
answers of the persons on the items of the questionnaire measuring the latent variable
31]. For ordered response data, the probability that an individual indexed ij with an underlying latent variable ? ij
, responds into category c (c = 1, . . . ,?C) on item k is represented by


pyijk=c|?ij,ak,?k=?ak?ij??kc?1??ak?ij??kc,
(4)

Fig. 1. Item response characteristic curves for item 2 from the STAI-DY with four answering
categories. The item that was used for this example was ‘I feel nervous and restless’
with four answering categories ‘1. Almost never’, ‘2. Sometimes’, ‘3. Often’, and
‘4. Almost always’. The crossing of two lines mark a threshold and can be interpreted
as the location on the latent parameter where the probability of choosing the corresponding
category or higher is 0.5

where ? kc
are the C k
???1 threshold parameters. The probability that the response yijk
falls into category c is the difference of the probability densities (?) of category c???1 and category c. The response categories are ordered as ??????? k1
???? k2
???? k3
????.

This item response model is called the graded response model 32] (or ordinal probit model 31], 33]). A Rasch 34] restriction was used fixing the discrimination parameters, a k
, to one.

Computing and generating IRT- and CTT-based scores

Different methods exist for generating values for the latent variable in the IRT framework.
The first way is to generate point estimates of the latent variable modeled by the
IRT model by constructing the posterior distribution of the latent variable given
the data. An important assumption of IRT modeling is conditional independence, which
entails that response probabilities for items rely solely on the latent variable,
? ij
, and the item parameters. As a result, the joint probability of a response pattern
yij
, given the latent variable ? ij
, of a person j on measurement occasion i, and given the item parameters, over the K items of the questionnaire is the product of the probabilities of the individual
answers of a person on all items of the questionnaire given this person’s position
on the latent variable ? ij
(equation Eq. 5).


pyij|?ij=pyij1|?ijpyij2|?ij…pyijK|?ij=?k=1Kpyijk|?ij,
(5)

When assuming a prior distribution for the latent variable distribution, g(? ij
), a posterior mean can be derived from the posterior p(? ij
|yij
)?=?p(yij
|? ij
)g(? ij
)/p(yij
), where the posterior is derived according to Bayes’ rule 35]. The posterior mean can be used as an estimate of the latent score.

When using the posterior mean as an estimate of the latent score and thus as an outcome
variable in the structural model, the uncertainty associated with the score is ignored.
To account for this uncertainty, plausible value technology is used 18], 19], 36], 37] where the latent outcome variable is treated as missing data. Plausible values are
generated from the posterior distribution of the latent variable to obtain a complete
data set. This data set can be analyzed in the secondary data analysis. When constructing
the posterior of the latent variable, all available information is used. The posterior
is proportional to the likelihood times the prior, which can be represented by


p?ij|yij,?2,?2,??pyij|?ijp?ij|?2,?2,?.
(6)

The structural model parameters are integrated out such that the (marginal) posterior
of the latent variable only depends on the response pattern. This marginal posterior
is only dependent on the data, in this case upon the data of subject j on occasion i. We sample from the marginal distribution in order to obtain plausible scores for
subjects with similar response patterns and background characteristics as in the sample
of subjects 20], 21].

For the CTT model, the sum score defined in equation Eq. 3 is considered to be an
unbiased estimate of the true score. This true score is considered to be an outcome
of the multilevel model in equation Eq. 1. When assuming the CTT model for the measurement
of the construct score, according to equation Eq. 2, the distribution of the observed
scores given the true score is given by p(yij
|? ij_CTT
). Subsequently, the posterior distribution of the true score is given by


p?ij_CTTyij,?2,?2,??pyij?ij_CTTp?ij_CTT?2?2?.
(7)

Parallel measurements are needed to estimate the true score (error) variance, but
they are usually not available. When the measurement error variance cannot be estimated
under the CTT model, the first term on the right-hand side is not included in defining
the posterior distribution, and an unbiased estimate of the true score is assumed.
However, the measurement error can still be assumed to be included in the multilevel
model specification (i.e., the second term on the right-hand side). In that case,
the population variance is used as an approximation of the subject-specific measurement
error variance 24] pp. 155. This approach was also used in the present study.

Analogue to generating plausible values under the IRT model (equation Eq. 6), the
marginal distribution of the (true) scores is used to generate plausible values under
the CTT model. Note that the drawn plausible values are realizations of the true score
under the structural multilevel model given the sum score as an unbiased estimate
of the true score.

In the literature, it is recommended to draw five sets of plausible values to address
the uncertainty associated with the plausible values for the missing data 37], 38]. Various results of data analysis are obtained for the five different complete data
sets, which are constructed from multiple sets of plausible values. The final results
are constructed by averaging the analysis results, in this case, the structural model
from equation Eq. 1.

Comparing CTT and IRT-based estimates

When comparing the IRT and CTT-based structural model estimates, it is required to
take scale differences into account. For the comparison, the CTT scores were rescaled
to the IRT-based plausible values scale, using a linear transformation as proposed
by Kolen and Brennan 39] pp. 337,


scy=?pv?Yy+?pv??pv?Y?Y,
(8)

where ?(pv) and ?(pv) are the mean and the standard deviation of the IRT-based plausible values, and where
?(Y), and ?(Y) are the mean and standard deviation of the CTT-based plausible values.

Next, the structural model was fit to the plausible values for each of the five draws.
Finally, the estimates resulting from the structural model were pooled to obtain the
final parameter estimates.

Simulation study

A simulation study is presented for evaluating the use of IRT-based plausible values
compared with CTT modeling for estimating latent variables used in longitudinal multilevel
analysis. The aim of this study was to investigate how the true values of the population
parameters are retrieved in different situations with varying sample sizes, number
of items and skewness of the latent variable distribution.

Design

The full cross classified design resulting in 546 conditions is depicted in Table 1. Per condition, 10 datasets were simulated using R statistical software 40] and analyzed using an extended version of the R-Package mlirt 31] and WinBUGS 41], 42]. Data was generated following the model described in equation Eq. 1 with the variance
between persons (i.e. level-2 variance) set to ?2
=.8, and repeated measurement variance to ?2
=.4 while using the IRT-model from equation Eq. 4 to generate values for the normally
distributed underlying latent variable; ? ij
?~?N(0,?1). An unidimentional latent variable was assumed to cause the responses, measured
six times J?=?6 using a varying amount of items with four answering categories per item C?=?4. The number of items that were used are listed in Table 1. In the simulated data, skewness of the latent variable was generated by changing
the location of threshold parameters of the IRT model. For example, for a positive
skewness of 2.6, ? k1
?=?3, ? k2
?=?2, and ? k3
?=?1 were used for all items k. This skewed to the right data could indicate that relatively healthy persons were
asked to answer a questionnaire measuring clinical depression, leading to high sum-scores.
The same can occur when subjects have recovered after treatment and the same questionnaire
is used on the baseline and follow up measurement. Item scores were generated based
on the latent variable and the parameters of the IRT model. The CTT based scores are
calculated according to equation Eq. 3 using the simulated item scores. These sum-scores
are the CTT estimates for the latent variable. In Fig. 2, the distributions of the CTT scores are visualized using density plots of different
skewness conditions. In epidemiological data, skewness in the data is often found.
The influence of various levels of skewness of the latent variable distribution on
retrieving the multilevel regression parameters was investigated.

Table 1. Simulation conditions. The conditions of the full cross classified design for the
simulation study. The numbers of items, number of participants, as well as the skewness
from a normal distribution of the latent variable were varied

Fig. 2. The density of four different skewed normal distributions for the latent variable

Figure 3 shows a schematic display of the simulation procedure for one replication. IRT and
CTT-based plausible values were generated using the datasets from one of the simulation
conditions. The CTT-based Plausible values were rescaled according to equation Eq.
7 and the structural model described in equation Eq. 1 was fit to five draws of plausible
values and the results were pooled by averaging the parameter estimates. Mean squared
errors (MSE) given by


MSE?^=Var?^+Bias?^?2,
(9)

Fig. 3. Simulation procedure for one replication

were calculated where
?^

denote the parameter estimates resulting from the different replications and ? are the true values.

The MSE’s were calculated for the level-1 and level-2 variance estimates for both
the CTT and the IRT-based plausible value analysis.

Simulation Results

Figure 4 shows a selection of the variance estimates within persons (level-1) and between
persons (level-2). The estimates for the IRT-based plausible value scores are closer
to the true parameter value of 0.4 for the within person variance and 0.8 for the
between person variance compared to the estimates from the CTT-based analysis. The
difference between the methods is the smallest when the latent variable is perfectly
normal distributed, and becomes gradually bigger with increasing skewness of the latent
variable distribution. The estimates from the CTT model get closer to the true values
when the number of items increase moving from the left to the right graphs. The estimates
from the IRT model are close to the true values for number of item conditions with
except for the N?=?100 condition. The CTT repeated measurement variance estimates for the conditions
with ten or less items are even higher compared to the between person variance estimates
in case of more extreme skewness. This in contrary to the IRT-based estimates, where
the variance estimates are very close to the true values regardless of the skewness.
Overall, the IRT method gives more accurate estimates compared to the CTT model over
all the simulation conditions. When comparing the plots from the top to bottom, the
sample size increases from N?=?100 to N?=?500 to N?=?1,?000. Increase in sample size does not influence the magnitude of the difference
between both models. The IRT model gives better estimates in all sample size conditions.
With the increasing sample size, the lines between the estimates become more stable,
indicating a more stable pattern of the differences between both methods. The MSE’s
for the variance estimates are presented in Fig. 5, where it can be seen that the CTT model systematically overestimates the repeated
measurement variance and underestimated the between person variance. The differences
between the true value and the estimated value by the CTT model increases when the
latent variable distribution becomes more skewed. These differences become smaller
for the CTT-based estimates when the number of items and the sample size increase.
The observed difference between the IRT and CTT estimates seems to be dependent on
the manipulated factors. The extremer data situations are causing larger differences
between IRT and CTT-based estimates in a consistent way. A complete representation
of the results can be found in Additional file 1 and Additional file 2 online.

Fig. 4. Pooled variance estimates. Selection of the pooled variance estimates for different
distributions, sample sizes (N?=?100, 500, 1000), and number of items (K?=?5, 10, 20). The upper left plot for example represents 13 different skewness conditions
(from negative to positive) with 100 participants measured on six time points with
a five-item questionnaire. The points represent the IRT and CTT-based estimates for
the level-1 (repeated measurement) and level-2 (between person) variance. The horizontal
lines represent the true values for the variance parameters

Fig. 5. Mean squared errors. Plots of the MSE’s for level-1 (repeated measurement) and level-2
(between person) variance parameter estimates resulting from both the IRT and the
CTT-based latent variable models

Empirical dataset

An example dataset was analyzed to illustrate the application of IRT-based plausible
values in epidemiological practice. Data were obtained from the Amsterdam Growth and
Health Longitudinal Study (AGHLS), which is a multidisciplinary longitudinal cohort
study that was originally set up to examine growth and health among teenagers 43]. Data from the AGHLS were used in previous research to answer various research questions
dealing with the relationships between anthropometry 44], physical activity 45], cardiovascular disease risk 46], 47], lifestyle 48], 49], musculoskeletal health, psychological health 50] and wellbeing. The presented sample consists of 443 participants who were followed
over the period 1993–2006 with maximal three data points over time nested within the
individuals. A subscale of the State Trait Anxiety Index Dutch Y-version (STAI-DY)
51] questionnaire was used to measure the latent variable ‘state anxiety’ and consists
of 20 items with four answering categories. The histograms in Fig. 6 depict the sum-score distributions on the three measurement occasions. The aim of
the analysis was to estimate the intercept and the variance parameters (i.e. an intercept
only model) in order to compare the CTT and IRT-based estimates. The measurement models
as well as the structural model that were used are comparable to those in the simulation
study above.

Fig. 6. Sum-score distributions. Histograms with the sum-score distribution for the latent
variable ‘state anxiety’ on the three measurement occasions in the AGHLS cohort. The
skewness of the sum-score distributions was 1.02; 0.99; and 1.18 on the first, second,
and third measurement occasion from left to right, and 443, 338, and 126 participants
were included respectively