Poor quality in the reporting and use of statistical methods in public health – the case of unemployment and health

The quality of the reporting and use of statistical analyses in articles that studied
the effect of unemployment on health was poor. This is not unique given the focus
of this review. The issue has also been addressed in previous analyses of methodological
shortcomings in scientific articles. This review focused on twelve different criteria.
The majority of the articles failed to fulfill most of the criteria, the main ones
being conformity with a linear gradient (100 % of the articles), validation of the
statistical model (100 %), collinearity of independent variables (97 %), fitting procedure
(93 %), goodness of fit test (78 %), selection of variables (68 % for the candidate
model; 88 % for the final model), and interactions between independent variables (66 %).

All but one article fulfilled the criterion for events per variable. However, in some
of the articles in which the criterion were fulfilled, the authors were probably not
aware of this, but it was achieved because of a large sample size and a restricted
number of candidate independent variables. Previous reviews reported a potential problem
with events per variable in the statistical model for 39–62 % of the included studies
2], 5], 11], 16], 17], 19], but exceptions to this are the reviews by Mantzavinis et al. that reported problems
for only 10 % of the articles and by Kalil et al. that reported problems for 16 %
of the articles 7], 14]. None of the articles with continuous variables provided evidence that their choice
of a linear gradient for the continuous variables in the statistical model was valid.
This criterion has been previously shown to be poorly met with at most 29 % of articles
reporting such evidence 14]. Other evaluations report that only about 15 % of articles fulfill this criterion
2], 7], 9], 11], 16], 17], 19].

Few articles (34 %) included or discussed interaction terms, but those that did also
tested the significance of the interactions. Similar results were presented in most
previous reviews 2], 5], 7], 9], 11], 14], 16], 17], 19], 21], with one review showing that 45 % of the articles fulfilled the criterion 21] and one showing that fewer than 10 % fulfilled the criterion 11]. A majority of the articles reviewed here (66 %) presented stratified results, which
to some extent can be considered to fill the role of checking for interactions in
the statistical model, but this is not a sufficient reason for failing to discuss
potential interactions in an article. Only two articles discussed collinear variables.
Previous similar reviews also reported that few articles discuss the potential problems
arising from collinear variables. It was only the review by Ottenbacher et al. (17 %
of articles) that reported that this was fulfilled in over 10 % of the reviewed articles
2], 7], 11], 16], 17], 19], 21].

None of the articles reported that they had validated their statistical models, and
this is consistent with previous similar reviews which showed that few articles reported
that their models had been validated (at most 10 % in Tetrault et al.) 2], 7], 11], 16], 19], 21]. Only two articles provided explicit statistical significance for the model. In 14
articles, neither a confidence interval nor an explicit p-value was given; instead, a p-value was presented as simply below or above the significance level. Previous articles
that have evaluated how frequently articles present confidence intervals in the logistic
regression have reported similar proportions to the 14 % reported here. They include
26 % reported by Moss et al. and 29 % reported by Ottenbacher et al. 17], 19]. Despite being recommended by many authors to always report confidence intervals
75], it was not reported for most other methods.

About 48 % of the articles did not report all coefficients in the statistical models.
In previous evaluations this criterion was fulfilled to a greater extent, even though
these evaluations also required that confidence intervals were reported for the independent
variables 7], 17]. Few articles (22 %) presented goodness of fit tests in this review, but this was
more than in previous reviews 2], 7], 9], 11], 16], 19]. In three of these previous reviews, fewer than 5 % of the included articles reported
such tests 2], 7], 16]. However, the higher rate of the use of goodness of fit tests in this review can
mainly be explained by the inclusion of articles that do not use logistic regression.

Most of the articles may have had reasons for the inclusion of variables, but it is
only possible to judge if this was the case if it is reported, and only 32 % of the
articles in this review did so. The selection of variables has often been evaluated
in previous reviews, and the number of articles that fulfill the criterion has varied
from as high as 95 % 2] to as low as 15 % 21]. In general, previous reviews have presented a higher frequency of articles fulfilling
the criterion than in this review 2], 5], 7], 9], 11], 14], 16], 17], 19], 21]. The importance of arguing for the final selection of variables has also been discussed
previously, but no previous reviews have evaluated this. Only few (12 %) of the articles
in this review commented on how the final selection of variables was performed. This
is an area in which improved reporting is needed.

It should be straightforward to report how variables are coded, but only 34 % of the
articles met this criterion. In four previous reviews, this coding was considered
insufficient in 85 % or more of the included articles 2], 5], 7], 19], and at most, 84 % of the articles reported fulfilling the criterion 21] in previous reviews 2], 5], 7], 9], 11], 14], 16], 17], 19], 21]. The fitting of the model was only mentioned in 7 % of articles reviewed here. In
previous similar reviews, this requirement was variously met in 27 to 65 % of the
articles 2], 7], 17], 19].

Methodological concerns are highlighted in the STROBE statement 24]. To my knowledge, I am the first one to evaluate how well articles are fulfilling
this criterion. I decided to include this because I recognized while reading the articles
in this review that surprisingly little space was devoted to this important issue.
The choice of method for analyzing the data is crucial for presenting reliable results.
Only six (15 %) of the articles in this review discussed the choice of statistical
model, and 24 % of the articles did not explicitly discuss limitations related to
their analyses. No statistical model is “perfect”, and therefore it is important for
authors to inform the reader about weaknesses even if they are difficult, or even
impossible, to avoid.

Most of the criteria in this review were used previously, for example, by Bagley and
colleagues 2] and Concato and colleagues 5]. Even if articles fail to fulfill some of the criteria in this review, their analyses
might still be of high quality. However, failure to report on what has been done suggests
that authors lack knowledge about the statistical methods used to analyze their data.
This is critical because wrong statistical analyses could potentially lead to incorrect
results and, consequently, to wrong conclusions. However this review did not attempt
to evaluate whether the results from the statistical analyses were valid in the reviewed
articles.

Only one article stated that individual study data were available online 31]. In general, the lack of information about how statistical analyses were performed
made it impossible to review the validity of the study results. However, it was obvious
in a few articles that the analysis of the data did not correspond to the research
aims therefore bringing the validity of the results to question. There has been evidence
presented for mistakes in the performance of statistical methods by, for example,
Garcia-Berthou and Alcaraz who reported incongruences between test statistics and
p-values 77]. Lucena et al. reported that 41 % of their 209 reviewed articles used inappropriate
statistical methods 12].

There are other criteria related to the quality of the performed statistical analyses
that would have been valuable to assess in this review. Among these, it would have
been useful if the articles in this review had assessed the use of cut-offs for the
variables in the model. The grouping of variables seemed to be well chosen in the
articles, but nevertheless it is a major weakness that cut-offs were rarely explained
or justified. The efficiency of analysis is improved if as much information as possible
is given regarding the variables, making it important to provide the reasons for the
chosen cut-off. Some of the criteria in this review are related to confounding effects.
However, whether a variable is a confounder or an effect modifier cannot be interpreted
only from the coefficients of, for example, a logistic regression. The general impression
from the articles in this review is that some authors adopted a very superficial approach
to the assessment of confounders. I am not aware of any studies that have evaluated
articles based on this aspect, and I suggest that this would be a good topic for a
future study.

The STROBE statement was developed as a guideline for the reporting of observational
studies, and it has been recommended as a checklist for scientific journals 24]. Lancet is one of the journals that require that the STROBE checklist is submitted,
but the other most reputed medical journal, the New England Journal of Medicine (NEJM),
does not require it. Instead, NEJM demands that the criteria for statistical analysis
listed by Bailar and Mosteller in 1988 are fulfilled 78]. However, neither the STROBE statement nor the criteria by Bailar and Mosteller require
that the criteria brought up by, for example, Bagley and colleagues and Concato and
colleagues are fulfilled and/or discussed in articles 2], 5]. Hence, current guidelines for publication are not sufficient to ensure that the
criteria used in this review are fulfilled, and publication guidelines are in need
of further improvement.

The aim of this article was not to propose a new guideline or to propose an extension
of current guidelines. The criteria that I have evaluated could be considered for
such guideline. The challenge of improving current guidelines, has to be taken up
by a well-reputed group of experts, similar to the STROBE group (or even the STROBE
group itself), as a consensus is very important for such guidelines to be well received.
Some of the criteria I have listed I consider crucial that they have been well thought
through, among them the criteria related to interactions, collinearity, and conformity
with a linear gradient. Poor handling of such criteria may result in highly biased
estimates. Other issues covering validation of the statistical model place a higher
demand on the statistical competence of the authors but inadequacy in this area can
bias the estimates and therefore this is an important issue. It is particularly surprising
that the criteria for variable coding and selection are not addressed, as little statistical
competence is required to do so. Such documentation should be integral to the implementation
phase of the study.

If authors are overloaded with instructions, this may mean that some valuable studies
are not published. Current guidelines, such as the STROBE statement 24], require that a checklist is filled in and submitted as supplementary file. My suggestion
for an improvement is that such a checklist should be submitted regarding the statistical
analyses. If a study has been performed to an acceptable level then documenting what
has been done should not be difficult. This is not time demanding, and the value for
other researchers is substantial as they will be able to check the analyses and conclusions.
Although some articles may not be published if the guidelines are tightened as suggested
here, this would improve the overall quality of published papers. However one reasonable
argument against introducing further criteria such as suggested here, is that it may
be difficult to stay within word limits. However on the other hand some information
could be moved from main manuscript to a supplementary file to offset this. It might
also be that some of the criteria would not demand additional analysis, but that it
would be sufficient to simply report that the analysis was not done in the checklist,
e.g. no internal validation was performed.

An additional value from a checklist is that if important issues are highlighted in
the journals’ instructions to the authors, it is likely that the authors will be better
prepared to deal with these issues and such instructions might even help the authors
in their statistical analyses. It is also important in the publication process that
the reviewers are capable of evaluating how well the authors have fulfilled the criteria.
Thus, I recommend not only improving the guidelines for authors, but also asking reviewers
to assess the extent to which articles fulfill the criteria for statistical analysis.
It is important, of course, that such requirements are handled in such a manner as
to not unduly burden the reviewers because the review process is already highly demanding
of reviewers’ time.