The extension of total gain (TG) statistic in survival models: properties and applications

Simulation study

We conducted extensive simulation studies to explore the properties of T G(t) and T G STD
(t). Choodari-Oskooei et al.1] described the properties that a ‘good’ measure of predictive ability for a survival
model should possess. They are: i) independence from censoring; ii) monotonicity; iii) robustness against influential (extreme and outlier) observations; and iv) interpretability. Our simulations, therefore, were carried out to explore the performance
of the measures with respect to these criteria.

In this section, we first describe the simulation model. Then, we present the results
of simulations and assess the performance of the measures with respect to the above-mentioned
criteria. We investigate the upper bound of both T G(t) and T G STD
(t), as well as the impact of non-proportional hazards on T G STD
(t). The simulation model (exponential), censoring mechanisms (random censoring), censoring
proportions, covariate distributions (normal, positively skewed, and negatively skewed),
and covariate effects assumed in our studies are explained below.

Simulation of censored time-to-event data

We simulated time-to-event data from the exponential distribution with baseline hazard
rate ?. The survival time in a proportional hazards model with a covariate, Z, was simulated as

(9)

where U is sampled from the standard uniform distribution, U(0,1). To generate randomly censored survival times, we followed guidelines provided
by Burton et al.22].

Design parameters

Covariate distribution and effects: We study the measures in the context of multiple regression where the PI, i.e. the
linear predictor, in the model is generally a function of several variables. As a
result of the central limit theorem 1], the prognostic index should tend to Normality as the dimension of the parameter
vector ? increases. However, skewed prognostic factors are not uncommon in medical research
– for example see the distribution of the number of positive lymph nodes (skewness:
2.8 and Kurtosis: 16.2) and progesterone receptor (skewness: 4.8 and Kurtosis: 37.8)
in the breast cancer data set studied in 1]. Thus, we conducted our simulation study for three covariate distributions: normal
N(0,1); negatively skewed with skewness of ?2.8; and positively skewed with skewness
of 2.8. We applied the method proposed by Fleishman 23] to transform the standard normal distribution to skewed distributions with mean 0
and variance 1. For all covariate distributions, we carried out our simulations under
four covariate effects of exp(?)={1.25,1.5,2,4}.

Censoring mechanisms: we carried out our simulations under both random and type I (or administrative)
censoring with 20 %, 50 %, and 80 % censoring proportions. Since the results were very similar, we only present the results
under the random censoring condition.

Sample size and the number of replicates: sample size was set at 500 individuals, and the number of replicates was 5,000 is
all experimental conditions.

Uncensored data

In this section, we present the results for T G STD
(t) and discuss those for T G(t) in uncensored data. The results of simulation studies to evaluate the means of T G STD
(t) under 3 covariate distributions and 4 covariate effects are presented in Table 1. In the simulations, we considered 6 time points to show the behaviour of the measure
over time. The time points are the 5 th, 10th, 15th, 20th, 25th, and 50th centile
of the exponential distribution used to generate the survival times. They correspond
to 6 time points as T1
=2.57, T2
=5.28, T3
=8.17, T4
=11.20, T5
=14.43, and T6
=34.66.

Table 1. Mean and standard deviation (in brackets) of T G STD
(t) at 6 different time points by the covariate distribution (Cov.) and covariate effect
(e x p(?)) – sample size is 500, and 0 % censoring

Generally, the estimates of the measure are higher in positively skewed covariates
and lower in negatively skewed covariates. In most scenarios, there is a mild increase
with increasing time. We also conducted further simulations beyond the time point
T6
and to the maximal time points (data not shown). The results showed that T G(t) is 0 at time 0. It increases until a certain point (i.e. median of the underlying
time-to-event distribution), and then decreases again towards 0 at the maximal time
points. However, T G STD
(t) does not follow this pattern and its trend over time depends on the size of the
effect (and distribution) of the covariate. In all scenarios, the estimates of T G(t) and T G STD
(t) increase with increasing covariate effects. We also carried out similar simulations
with sample sizes of 200 and 1000 which resulted in similar conclusions. The results
showed that the dispersion of the measures decreases as sample size increases, as
expected.

Furthermore, we carried out similar simulation studies on the time-dependent version
of where R(?), and ?0
in Equation 4 are replaced with the corresponding R(?;t), and ?0
(t) – data not shown. Our results confirmed the underlying theory that and are asymptotically the same. However, since is a non-parametric measure, its sampling distribution has larger variance. For example,
for H R= 4 with one normally distributed covariate the means (standard deviation) of and are 0.40 (SD: 0.02) and 0.40 (SD: 0.04), respectively.

Censoring effect

In this section, we present the results of simulations to study the impact of random
censoring on T G STD
(t). The results are demonstrated in Table 2. The simulations were carried out for three covariate distributions, three censoring
proportions (20 %, 50 %, and 80 %), and one covariate effect of 0.693 (exp(?)=2). We report the average percentage difference between the means of the measures
in the censored data and the corresponding means of the measures in the uncensored
condition, i.e. the percentage bias.

Table 2. The percentage difference in the means of T G STD
(t) in censored data from those of T G STD
(t) in the corresponding uncensored data by covariate distribution (Cov.), and censoring
proportion

The results show that censoring has almost no effect on the estimates. The percentage
difference to the means of the measure in censored scenarios are less than 1 %, except in one case where the censoring proportion is more than 80 %. Even in this scenario, i.e. negatively skewed covariate, the means (standard deviation)
of sampling distribution of T G STD
(T6
) for 0 % and 80 % censoring conditions are 0.280 (SD: 0.022) and 0.285 (SD: 0.062), respectively –
a practically negligible difference in means.

We conducted similar simulations with sample sizes of 200 and 1000 and different covariate
effects which resulted in similar conclusions. However, the results showed that in
general the percentage difference decreases as the sample size increases in all censoring
proportions. As a measure of dispersion, we also calculated the standard deviation
of sampling distribution of the measure (data not shown) in each experimental condition.
The results showed that the dispersion of the measures decreases as sample size increases
in all censoring conditions, as expected. To illustrate this, Fig. 2 presents the sampling distribution of T G STD
(T2
) for different covariate effects, sample sizes, and censoring proportions.

Fig. 2. The sampling distribution of T G STD
(t) by covariate effect ?, sample size N, and censoring proportion. Number of replicates is 5,000 in all scenarios
and the covariate, Z, is normally distributed

Monotonicity and upper bound

The monotonicity property requires that T G STD
(t) should increase with the size of covariate effect, i.e. |?|. In this section, we applied simulations to explore the means of both TG and T G STD
for a range of covariate effects where the distribution of survival time is exponential
and the covariate is normally distributed.

The results are presented in graphs of Fig. 3. The figure shows the means of T G(t), left graph, and T G STD
(t), right graph, at different time points by the covariate effect in the 50 % random censoring condition. As it is evident from the graphs, T G(t) increases with ? and reaches a plateau of about 0.5; whereas T G STD
(t) reaches values close to 1 for large ?s. This accords with the finding of Bura and Gastwirth 9] for a logistic regression model where they showed that the upper bound of TG is less than or equal to 0.5. Finally, it is noticeable that the means of T G(t) at 6 time points differ for small values of ?, but they converge as the covariate effect becomes stronger. The means of T G STD
(t) are generally in agreement throughout the range of the covariate effect.

Fig. 3. The means of T G(t) (a) and T G STD
(t) (b) by the covariate effect in censored data. The covariate is normally distributed,
random censoring (50 %) condition is used with a sample size of 500

Influential observations

In this section, we study the impact of extreme and outlier observations on T G STD
(t) using simulations. We follow the definition of extreme and outlier observations
as outlined in 1]: an extreme observation fits the underlying relationship between survival time and
the covariate but it lies in the extremes of the (covariate) distribution, whereas
the outlier observation does not fit the underlying relationship. In simulations,
we generated survival times from an exponential distribution with one normally distributed
covariate N(0,1) and covariate effect of ?=0.69, with each replicated data set of size 200. Each data set was contaminated with
a single extreme or outlier observation according to the procedure described in 1].

The results are presented through graphs in Fig. 4. The graphs show the means of T G STD
(t) by the value of the outlying covariate observation. For example, 4 in the X axis represents the condition where one observation’s covariate, Z?N(0,1), is replaced with 4, and the corresponding value in the Y axis represents the mean of T G STD
(t). The mean of T G STD
(t) in the uncontaminated data is represented with ‘none’ on the X axis. The left graph in Fig. 4 shows the impact of one extreme covariate observation, and the right graph show the
impact of one outlier covariate observation at 5 different time points. In the graphs,
if a measure is resistant to extreme and outlier observations, its mean would not
change in the presence of such observations. In other words, we expect a horizontal
line across the X axis if the measure is resistant to such observations. The graphs indicate that the
means of T G STD
(t) are robust against the extreme observations, but they decrease rapidly as the outlier
observation becomes more severe in all time points.

Fig. 4. The impact of one extreme or outlier observation on the mean of T G STD
(t). a one extreme observation (left); b one outlier observation (right). The covariate is normally distributed, random censoring
condition, sample size = 200, and 50 %

Non-proportional hazard and time-dependent covariates

We carried out simulations to investigate the performance of T G STD
(t) under non-proportional hazards (non-PH) and time-dependent covariate effect in a
two arm trial setting. We used the Weibull distribution to generate time-to-event
data and obtained the corresponding shape and scale parameters for the distribution
of time-to-event data in each arm from IPASS (Iressa Pan-ASia Study) trial 24] – we used the same method as in 25] to estimate these parameters.

IPASS is a phase 3, two arm trial of previously untreated patients in East Asia who
had advanced pulmonary adenocarcinoma (lung cancer) 24]. The main results from IPASS are summarized in Mok et al.24]’s Figure two, which shows the distribution of time-to-event in each arm as Kaplan–Meier
curves. Their Figure two(A) (i.e. Figure two, Panel A) shows that the progression-free
survival curves cross at approximately 5.7 months, thus showing extreme non-PH. As
it has been shown in 25], Weibull distributions with the following scale and shape parameters provide a good
fit to the (censored) survival times in the two treatment arms: control arm parameters,
scale = 0.35 and shape = 1.72; and experimental arm parameters, scale = 0.10 and shape
= 1.08. We used these parameters to generate the survival times in the two groups.
We truncated the time to event at 20 months to resemble the follow-up pattern in Mok
et al.24]’s Figure two(A).

The fitted survival curves by treatment arm are shown in Fig. 5(a). The curves cross at approximately the median survival time, which is about 5.7
months in each arm. We chose the same sample size as that of the IPASS trial, i.e.
n=1127, with equal allocation ratio. We fitted a flexible parametric survival model
25] on the log cumulative hazard scale – see Additional file 2 for the details on the fitted model. Figures 5(b) and (c) show the fitted log hazard ratio and the cumulative hazard functions for
the two arms. We then evaluated T G STD
(t) from 1?20 months with 5,000 replications in each scenario. The results are summarised
in Fig. 5(d). The mean (standard deviation) of the sampling distribution of T G STD
(t) is 0.25(SD: 0.03) at 1 month, but it decreases towards 0 as the survival curves
of the two arms converge. The mean reaches its minimal value of 0 where the survival
curves cross at about 5.7 months. At this time point there is no separation/discrimination
between the two groups and the mean value of T G STD
(t) reflects this. It then increases as the survival curves diverge again until about
16 months, but it is starting to level off after this time point. Our results also
showed a similar pattern for the mean of for the first 16 months; although the means of are much smaller throughout – see the right y-axis in Fig. 5(d). None of , , and can be used to meaningfully evaluate the performance of the model in this setting.

Fig. 5. a Fitted survival curves by treatment group for the IPASS trial – simulated data. b Log hazard ratio over time. c Cumulative hazard functions by treatment. d The means of sampling distributions of T G STD
(t) and from 1 to 20 months – vertical lines represent the standard deviations in each scenario

Impact of categorisation of covariates

In this section, we study the impact of categorisation of covariates on T G STD
(t). We carried out simulations to explore its performance when the continuous prognostic
factors such as age and weight are categorised. Royston et al.26] explained the dangers of dichotomisation of continuous covariates in the context
of regression modelling, with the conclusion that it is an unnecessary practice for
statistical analysis. They also showed that it will reduce both the amount of prognostic
information and power, resulting in a reduction in the predictive ability of the fitted
model.

In our simulations, the (conditional) distribution of survival times were exponential.
The covariate was normally distributed as N(0,1) with an effect of exp(?)=4. We progressively categorised the covariate into j=2,…,20 categories by its quantiles. In each scenario, we computed percentiles corresponding
to percentages 100?k/j for k=1,2,…,j?1. For example, the categorisation of the covariate into 10 different groups requires
that the 10th, 20th,…, 90th percentiles be computed.

To compare the performance of the measures with those of other proposed measures of
predictive ability, we also carried out similar simulations for , , , and , which have been recommended by Choodari-Oskooei et al.1], 3] for general use. , , and summarise the predictive ability for the entire follow-up period, whereas is time-dependent and changes over the follow-up period. is based on the (modified) Brier score 14]. Both and are (monotonic) functions of the variance of the prognostic index of the model, whereas
is based on the expected likelihood (entropy) under the full and null models – see
1], 3] for their formula and further details.

The percentage difference between the means of the measures in the categorised scenario
and the corresponding means of the measures in the uncategorised condition are displayed
in Fig. 6. In general, a proportion of prognostic information is lost through grouping, which
should be reflected in the estimates of the measures. As Fig. 6 demonstrates the loss of prognostic information is much more pronounced in the estimates
of , , and . Except T G STD
(t), all the other measures monotonically increase as the number of groups increases
and reach values very close to the mean of the measures in the true model for more
than 7 groups. The unexpected fluctuations in the means of T G STD
(t) in less than 5 groups is due to the impact of integration under the predictiveness
curve in these scenarios, which is a step function. In general, the loss of prognostic
information is relatively small in both T G STD
(t) and .

Fig. 6. The impact of categorization on T G STD
(t), , , , and – T G STD
(t) and were evaluated at T6
. The percentage difference to the means of the measures in the uncategorised scenario
plotted on the vertical axis against the number of groups (horizontal axis). In the
uncategorised scenario, i.e. in the true model, the means of T G STD
(t), , , , and were 0.56, 0.54, 0.54, 0.64, and 0.33, respectively

Applications

In this section we present the results of investigation on the application of the
measures to data sets from 5 disease areas. The estimates of the measures (with 95
% bootstrap confidence intervals) are displayed in Table 3. We also compare the estimates of T G STD
(t) with those of , , , and . The data sets are from: i) breast cancer (Schumacher et al., 21]); ii) lymphoma (Rosenwald et al., 27]); iii) primary biliary cirrhosis (PBC) (Fleming and Harrington, 28]); iv) renal cancer (Ritchie et al., 29]); and v) prostate cancer (Byar and Green, 30]). All data sets are in the public domain – see Additional file 2.

Table 3. The estimates of T G STD
(t), , , , and , including 95 % bootstrap confidence intervals from 1000 replicates, in real data
sets at 3 time points. The time points T1
, T2
, and T3
at which T G STD
(t) and are evaluated in all data sets are the 25th, 50th, and 75th quantile of the follow-up
period, i.e. the time to the last event, in each study. Therefore, T1
, T2
, and T3
are different in each study

Multivariable prognostic models based on the Cox PH model have already been developed
for the above data sets. We applied the measures to these models to compare their
performance. The first two data sets have been analysed extensively by Choodari-Oskooei
et al.1], 3]- see Additional file 2 for further details on the data sets, prognostic factors included in each study,
and the summary of fitted models.

Table 3 shows the estimates of T G STD
(t) and at 3 time points, together with the estimates of , , and . The 3 time points at which T G STD
(t) and are evaluated in all data sets are the 25th, 50th, and 75th quantile of the follow-up
period (the time to the last event) in each study. We emphasise that in practice a
clinically motivated time point should be chosen. In all studies, the point estimates
of T G STD
(t) mildly increase with time, and are markedly higher than those for . In some data sets, the estimates are within close range of those for and .

As stated before, the advantage of T G STD
(t) over , , and is its ability to assess how the model’s predictive performance changes over time.
Figure 7 shows the estimates of T G STD
(t) over the entire follow-up in each study, together with the 95 % bootstrap confidence intervals. It can be concluded that the predictive ability of
the models for breast cancer, PBC, and (to some extent) lymphoma studies relatively
remain constant over the follow-up period.

Fig. 7. Estimates of TG
STD
(t) over the entire follow-up period in each study. a breast cancer study; b lymphoma data; c PBC study; d renal cancer data; and e prostate cancer. The solid lines represent the point estimates, and the dashed lines
are the 95 % bootstrap confidence intervals with 1,000 replicates in each experimental
condition