An external validation of models to predict the onset of chronic kidney disease using population-based electronic health records from Salford, UK


CKD prediction models included for external validation

Figure 1 depicts the model inclusion process. Of the 29 models identified by Collins et al.14] and Echouffo-Tcheugui and Kengne 15], 18 were developed with the aim of predicting CKD onset. We excluded three models
because of incomplete reporting of regression models (regression coefficients not
fully reported) in the original paper 49] and one model because it was developed in a specific sub-population (namely HIV patients)
20]. We excluded a further seven models for which we had more than one missing predictor
in our dataset, including missing data for eGFR, urinary excretion, and c-reactive
protein 50]; missing post-prandial glucose, proteinuria and uric acid 51]; missing eGFR and quantitative albuminuria 52], and finally, we excluded two models because of missing eGFR and low levels of high-density
lipoprotein cholesterol 52], 53], respectively. The final set consisted of seven models (five logistic regression
models and two CPH regression models) and five simplified scoring systems 36], 51]–56]. Table 1 describes the details of the included models, and Additional file 3: Tables S1, S2 and S3 provide the population characteristics of the development datasets,
the regression coefficients, and the simplified scoring systems.

Fig. 1. Procedure to identify and select CKD prediction models

Table 1. Details of studies developing CKD prediction models that were included for external
validation

All models were developed outside the UK, with the exception of QKidney® 36] (www.qkidney.org), which was developed on a large population from England and Wales selected from
general practices using the EMIS EHR. All included models used a different definition
of CKD, but the majority used an older definition based only on one impaired eGFR
measurement. Time horizons in original papers were different to our 5-year definition,
with the exception of QKidney® 36], which, however, allowed other time horizon options (1-, 2-, 3- and 4-year). For
three models, the prediction time horizon was not specified 54]–56]. However, we could derive from study duration and data collection procedures in the
original publications that the time horizons were 1 56], 2 54] and 9 54] years, respectively. For the remaining models, the reported time horizons were between
4 and 10 years 51], 52], 54].

Predictors included in the models were largely based on known CKD risk factors (hypertension,
diabetes mellitus, or history of cardiovascular disease). The only biomarkers included
were systolic and diastolic blood pressure, and body mass index. Multiple imputation
of missing values was applied to these variables, along with deprivation, haemoglobin
(i.e. to calculate presence of anaemia) and smoking. In these predictors, missing
values ranged from 1.8 % to 70.0 %, with a median value of 46.0 %. Conversely, we
excluded proteinuria as a predictor from our analyses due to 94.6 % missing data (Table 2); therefore, the models by Bang et al. 54] and Kwon et al. 55] had one missing predictor. Finally, three of the included models, which derived a
simplified scoring system 53], 55], 57], did not report the intercept of their underpinning logistic regression model, and
therefore we estimated the intercepts from the prevalence of CKD and predictors’ summary
statistics in the original studies.

Table 2. Patients with complete and incomplete follow-up data stratified for CKD onset; values
are numbers (%) unless indicated otherwise

Study population characteristics

Figure 2 shows the cohort selection process. There were 187,533 adult patients with at least
one record in the financial year 2009 in our database, of which 178,399 remained after
applying our exclusion criteria, with 6941 patients (3.9 %) that died before developing
CKD. There were 162,653 patients (91.2 %) who had complete follow-up data. Overall,
there were 6038 incident cases of CKD during the study period. Tables 2 and 3 describe the characteristics of cohorts with complete and incomplete follow-up.

Fig. 2. Cohort selection

Table 3. Prevalence of CKD risk factors (as expressed in NICE guidelines) stratified for CKD
onset; values are numbers (%) unless indicated otherwise

External validation

Table 4 presents the results of the external validation, namely discrimination and calibration.
AUC values ranged from 0.892 (95 % CI, 0.888–0.985) to 0.910 (95 % CI, 0.907–0.913)
for patients with complete follow-up data, and the c-index values for the two CPH
models on the full cohort were 0.888 (95 % CI, 0.885–0.892) 51] and 0.900 (95 % CI, 0.897–0.903) 36], respectively. Simplified scores showed similar performance to the models from which
they were derived. MAPE was below 0.1 for all models, with the only exception of Thakkinstian
et al. 56], for which the MAPE was 0.179 (standard deviation (SD), 0.161). Calibration plots
(Fig. 3) and related calibration slopes (Table 4) on the complete follow-up data showed similar figures to the MAPE analysis. Thakkinstian
et al. 56] confirmed a tendency for over-predicting risk with a calibration slope of 0.44 (95
% CI, 0.43–0.45). Conversely, the only models that were well-calibrated to our population
were the ones by Bang et al. 54] and QKidney® 36] with calibration slope values of 0.97 (95 % CI, 0.96–0.98) and 1.02 (95 % CI, 1.01–1.04),
respectively. All other models over predicted risks (i.e. calibration slopes ranging
between 0.53 [ 95 % CI, 0.52–0.53] and 0.68 [ 95 % CI, 0.67–0.69] ), with the exception
of the model by Kshirsagar et al. 53], which predicted lower risk and had a calibration slope of 1.74 (95 % CI, 1.72–1.76).

Table 4. Discrimination, MAPE and calibration slopes of included models in patients with complete
follow-up data (all models and risk scores) and in the full validation cohort (Cox
proportional hazards regression models only)

Fig. 3. Calibration plot of predicted and observed risk for the cohort of patients with complete
follow-up. On the bottom a rug plot in the form of histogram shows the distribution
of the predicted values

Table 5 reports the PPV, sensitivity and specificity for each of the simplified scoring systems.
In this analysis we included the full QKidney® 36] model as it does not have an associated simplified scoring system. We also included
the full O’Seaghdha et al. 52] model because we could not implement their scoring system: multiple predictors had
70 % or more missing values in our dataset. For two scoring systems (Chien et al.
51] and Thakkinstian et al. 56]), the best threshold in our population was different than the threshold proposed
in the development study. For QKidney® 36] and O’Seaghdha et al. 52], who did not report a threshold in the development study, the optimal threshold in
our population was 0.017 (SD, 0.002) and 0.086 (SD, 0.010), respectively. In terms
of performance, models showed similar performance, with a PPV, sensitivity and specificity
of approximately 0.145, 0.86 and 0.80, respectively.

The distributions of the linear predictors in the development datasets and the validation
dataset, calculated as proposed by Debray et al. 44], are shown in Table 6. For all models, the mean of the linear predictor in the validation dataset was lower
than in the development datasets: we found mean differences between 0.2 and 0.6, except
for the model of Thakkinstian et al. 56], which had a difference of 1.5. There were few differences between the mean linear
predictors computed on our dataset using summary statistics compared with individual
patient data.

Table 6. Mean linear predictor, calculated in development datasets and in our validation dataset
(patients with complete follow-up data only)

The threshold probability associated with the highest tenth of predicted risk varied
from 0.0692 for QKidney® 36] to 0.4256 for the model developed by Thakkinstian et al. 56]. When applying these thresholds to select the 10 % of patients with highest predicted
risks, QKidney® 36] identified 64.5 % of all patients that developed CKD during the study period. Proportions
for the other models ranged from 48.0 % for the model from Thakkinstian et al. 56] to 64.0 % for the model of O’Seaghdha et al. 52].

Decision curves for the cohort of patients with complete follow-up are presented in
Fig. 4. The models by Bang et al. 54] and QKidney® 36] had the best performance. At predicted probability thresholds lower than 0.5, their
net benefit was greater than all other models and greater than strategies labelling
all patients at high risk (black line) or none at high risk (grey line). For predicted
probability thresholds greater than 0.5, Bang et al. 54] and QKidney® 36] were equivalent to the choice of not labelling any patient as high CKD risk (grey
line).

Fig. 4. Decision curve analysis for the cohort of patients with complete follow-up

Sensitivity analyses

The sensitivity analysis conducted on patients with CKD risk factors showed comparable
calibration and MAPE (Bang et al. 54] and QKidney® 36] were the only well-calibrated models), with an overall decrease in discrimination
of about 0.1 (Additional file 3: Table S4) compared to our main analysis. Specifically, AUC values on patients with
complete follow-up ranged from 0.756 (95 % CI, 0.749–0.762) to 0.801 (95 % CI, 0.795–0.808),
while the c-index values for the two Cox regression models were 0.755 (95 % CI, 0.749–0.761)
51] and 0.775 (95 % CI, 0.769–0.781) 36], respectively. The performance of the simplified scoring systems was worse compared
to the models from which they were derived.

The sensitivity analysis in which CKD was defined by the presence of only one eGFR
measurement lower than 60 mL/min/1.73 m
2
or a diagnostic code for CKD 3–5 led to a higher prevalence of CKD onset (5.2 %, ?=?8854), with an overall predictive model performance that slightly decreased (Additional
file 3: Table S5), especially in terms of calibration. CKD onset prevalence was also higher
(3.9 %, ?=?6988) when we calculated eGFR by using the CKD-EPI formula, with an increase in
absolute numbers of approximately 1000 cases and an average age in this group of 76
years (SD, 8.1). Overall performance was similar to our main analysis, and only the
model by Bang et al. 54] was well-calibrated in this sensitivity analysis (Additional file 3: Table S8). As expected, we witnessed an increase in CKD onset prevalence (7.6 %,
?=?13,652) when we counted patients that died during follow-up as if they developed
CKD; however, that did not lead to changes in discriminative performance of the models
(Additional file 3: Table S6). Conversely, calibration improved for all models that were over-predicting
CKD in our main analysis. In the analysis restricted to patients with complete data
on all predictors we found an overall decrease in performance of about 0.08 for AUCs
and c-index (Additional file 3: Table S7), while the sensitivity analysis that used a 4-year time horizon showed
similar discriminative performance to our main analysis, but worse calibration for
all models except QKidney® (Additional file 3: Table S9).