|
|
||||||||
J Thorac Cardiovasc Surg 2006;131:1243-1247
© 2006 The American Association for Thoracic Surgery
General Thoracic Surgery |
a Unit of Thoracic Surgery, "Umberto I" Regional Hospital, Ancona, Italy
b Division of Thoracic Surgery, Sheffield Teaching Hospital, Sheffield, United Kingdom.
Received for publication November 29, 2005; revisions received January 3, 2006; accepted for publication February 6, 2006. * Address for reprints: Alessandro Brunelli, MD, Via S. Margherita 23, Ancona 60129, Italy (Email: alexit_2000{at}yahoo.com).
| Abstract |
|---|
|
|
|---|
METHODS: Eleven mortality models (1 developed by means of logistic regression and bootstrap validation and the other 10 developed by means of the traditional training-and-test random splitting of the dataset) were generated by the data of unit A (571 patients submitted to major lung resection). The performances of each of the 11 mortality models were then evaluated by assessing the distribution of the respective c-statistics in 1000 bootstrap samples derived from unit B (224 patients).
RESULTS: The first model (logistic regression and bootstrap analysis) had good discrimination among the 1000 bootstrap external samples (c-statistics >0.7 in 80% of samples and >0.8 in 38% of samples). Among the 10 training-and-test models, only one model had a similar performance, whereas the others had a poorer discrimination.
CONCLUSIONS: The traditional training-and-test method for risk model building proved to be unreliable across multiple external populations and was generally inferior to bootstrap analysis for variable selection in regression analysis. Therefore the use of bootstrap analysis must be recommended for every future model-building process.
| Introduction |
|---|
|
|
|---|
Regression analyses are the analytic techniques most commonly used for risk modeling. However, the resultant models are useful only if they reliably predict outcomes for patients by determining significant risk factors associated with the outcome of interest. A problem might arise from this dependence on risk factor analysis. Different investigators evaluating the same predictors through regression analysis might obtain heterogeneous results because of methodologic discrepancies and inadvertent biases introduced in the statistical elaboration.
1
In the early 1980s, computer-intensive computational techniques, termed bootstrap methods, were popularized.
2-6
Bootstrap analysis is a simulation method for statistical inference, which, if applied to regression analysis, can provide variables that have a high degree of reproducibility and reliability as independent risk factors of the given outcome.
In fact, the predictive validity of a model can be assessed not only in one randomly split set of patients, as in the traditional training-and-test method, but also in perhaps hundreds or, typically, 1000 new different samples of the same number of patients as the original database obtained by means of resampling with replacement.
We hypothesized that the traditional training-and-test method for model building might generate models that are heavily biased by the characteristics of the patients who are sampled to derive and test them. The external performance of these types of models could be extremely variable and therefore totally unreliable. On the other hand, by using the entire dataset for model construction and bootstrap analysis for validation and variable selection, a more robust and stable model would be obtained, which can be more reliably applied to external patients.
Therefore the objective of the present study was to compare the performance of a mortality model adjusted for the covariates contributing to the risk of death developed from the entire dataset of patients submitted to major lung resection in one single unit and validated by using the bootstrap procedure with that of multiple mortality models developed by using the training-and-test method from the same dataset. To this purpose, each model was assessed in 1000 external bootstrap samples derived from another set of patients operated on in another unit during the same period.
| Patients and Methods |
|---|
|
|
|---|
Two different model-building approaches were used. The first method (model A) consisted of using the entire dataset for model construction. The following variables were initially evaluated for possible association with postoperative mortality: age, body mass index (in kilograms per square meter), type of operation (lobectomy vs pneumonectomy), type of disease (benign vs malignant), neoadjuvant chemotherapy, presence of coronary artery disease (CAD), forced expiratory volume in 1 second (FEV1), carbon monoxide lung diffusion capacity (DLCO), predicted postoperative FEV1 (ppoFEV1; calculated by using the formula),
|
|
|
|
Survivors and nonsurvivors were initially compared by means of univariate analyses performed with the unpaired Student t test or the Mann-Whitney test for numeric variables and the
2 test or the Fisher exact test for categoric variables. Multicollinearity among variables was obviated by using only one variable (selected by means of bootstrap analysis) in a set of variables with a correlation coefficient greater than 0.5 in the regression analysis. Variables with a P value of less than .1 at univariate analysis were used as independent variables in a stepwise logistic regression analysis (dependent variable of mortality). A P value of less than .1 was selected for variable retention in the final regression model. The model was then validated by means of bootstrap analysis. In the bootstrap procedure 1000 samples of 571 patients were sampled with replacement. Stepwise logistic regression analysis was applied to every bootstrap sample. The stability of the final model was assessed by comparing the frequency of occurrence of the variables of the final model in the bootstrap samples. If the predictors occurred in more than 50% of the bootstrap models, they were judged to be reliable and were retained in the final model.
8
Unreliable variables, if present, were removed from the final model.
The second method (model B) consisted of the traditional training-and-test splitting method. The dataset was randomly split into 2 sets of patients. The first set (60% of the database) was used to develop the model. The same variables used in the first method were initially evaluated for possible association with postoperative mortality. Screening for univariate associations and multicollinearity was performed in the same way in the second method as described for the first method. Variables with a P value of less than .1 at univariate analysis were used as independent variables in a stepwise logistic regression analysis (dependent variable of mortality). A P value of less than .1 was selected for variable retention in the final regression model, for which the calibration and discrimination was assessed with the remaining 40% of patients (test set) by using the Hosmer-Lemeshow goodness-of-fit statistic and by using the c-statistics or area under the receiver operating characteristic curve.
9-12
The proportion of patients sampled in the training and test samples (60% and 40%, respectively), was selected in accordance with recently published analyses on risk modeling in lung surgery.
13
Ten models were developed by repeating the training-and-test method 10 times. Therefore a total of 11 mortality models were obtained. The performance of each of these models was assessed 1000 times, each time using a bootstrap sampling of 224 patients drawn with replacement from the database of unit B, by evaluating the distribution of the c-statistics in these samples.
9-12
Prospective, electronic, quality-controlled, clinical databases at the 2 participating centers were used for the analysis of data. The study was approved by the local institutional review boards, and informed consent concerning prospective data collection was obtained from all patients. The authors had access to the primary data, directed the analyses, and made all decisions pertaining to the article and its submission for publication.
All tests were 2-tailed and were entirely performed with the Stata 8.2 statistical software (Stata Corp, College Station, Tex).
| Results |
|---|
|
|
|---|
|
|
|
(Hosmer-Lemeshow statistics, 7.2; P = .5; c-statistic, 0.76). The expression InR/1 + InR represents the probability of dying because in the logistic regression equation the logarithm of the odds of the outcome (termed the logit or log odds) is used as the dependent variable.
The second statistical method (training and test) repeated 10 times yielded the following different mortality models:
The distribution of the different predictors in the 11 mortality models is shown in Table 2.
|
|
|
| Discussion |
|---|
|
|
|---|
We hypothesized that the traditional training-and-test method for model building, consisting of a random splitting of the database into a derivation set from which to construct the model and a test set in which to assess its calibration and discrimination, might be subject to sampling noise. To this purpose, we repeated 10 training-and-test sessions, producing 10 corresponding mortality models. Seventy percent of these models included different combinations of variables. The performance of each of these models was assessed 1000 times, each time using a bootstrap sampling of 224 patients drawn from the dataset of another unit. The distribution of the c-statistics was extremely variable from one model to another, and in general, their performances in external samples were only modest. The development of risk-adjusted models by the method of training and testing appears therefore completely unreliable.
Bootstrap analysis was recently proposed as a breakthrough method for internal validation of surgical regression models.
8,14
The main advantage of this technique is that the entire dataset can be used for model building, which would yield more robust models, especially in moderate-size databases and for rare outcomes (eg, mortality after major lung resection).
15
Furthermore, the predictive validity of the model can be assessed not only in one randomly split set of patients but also typically in 1000 new different samples of the same number of patients as the original database obtained by means of resampling with replacement. By using this method, we constructed and validated a mortality model, which, when assessed in 1000 bootstrap samples drawn from another unit, performed better than the majority of the models developed by using the training-and-test method. This shows that the bootstrap procedure can yield stable models across multiple populations, warranting its use as a standard instrument in future model-building analyses. Yet a search of PubMed performed over the last 5 years yielded, at the time of this writing, only 16 surgical articles (published in the English literature and dealing with human subjects) that used logistic regression analysis and bootstrap for its validation (0.003% of the total number of surgical articles that used logistic regression analysis and were published during the same period). It is clear that although bootstrap technology has broken down important barriers to surgical clinical research,
8
its importance appears still largely underestimated by most surgeons. This might be due to the paucity of readily available high-quality statistical software incorporating this analysis or the lack of understanding of the methodology, which might make surgeons perceive this statistical technique itself as a barrier to their interpretation of clinical data analysis reports. In this regard a specific statistical training focusing on a reliable evaluation of the surgical outcome would be of help to disseminate a culture of quality improvement practice among surgeons.
On the basis of our results, we regard the process of developing risk models or risk factors without bootstrap validation as unreliable, obsolete, and resembling more an art than a science.
8
In view of the unreliability of the training-and-test method, previously published reports using it should be interpreted with caution.
Bootstrap analysis can formalize the development of model building, removing much of the human biases associated with regression analysis, providing a balance between selecting risk factors that are not reliable (type I error) and overlooking variables that are reliable (type II error), and introducing a concrete measure of reliability of the risk factors.
8
For this reason, the use of bootstrap analysis must be recommended for every future surgical model-building process.
| Footnotes |
|---|
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
A. Brunelli, R. G. Berrisford, G. Rocco, G. Varela, and on behalf of the European Society of Thoracic Surg The European Thoracic Database project: composite performance score to measure quality of care after major lung resection Eur. J. Cardiothorac. Surg., May 1, 2009; 35(5): 769 - 774. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Brunelli, R. Belardinelli, M. Refai, M. Salati, L. Socci, C. Pompili, and A. Sabbatini Peak Oxygen Consumption During Cardiopulmonary Exercise Test Improves Risk Stratification in Candidates to Major Lung Resection Chest, May 1, 2009; 135(5): 1260 - 1267. [Abstract] [Full Text] [PDF] |
||||
![]() |
P Stone, L Kelly, R Head, and S White Development and validation of a prognostic scale for use in patients with advanced cancer Palliative Medicine, September 1, 2008; 22(6): 711 - 717. [Abstract] [PDF] |
||||
![]() |
A. Brunelli, M. K. Ferguson, G. Rocco, P. Pieretti, W. T. Vigneswaran, N. J. Morgan-Hughes, M. Zanello, and M. Salati A Scoring System Predicting the Risk for Intensive Care Unit Admission for Complications After Major Lung Resection: A Multicenter Analysis Ann. Thorac. Surg., July 1, 2008; 86(1): 213 - 218. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Brunelli, M. Refai, F. Xiume, M. Salati, V. Sciarra, L. Socci, and A. Sabbatini Performance at Symptom-Limited Stair-Climbing Test is Associated With Increased Cardiopulmonary Complications, Mortality, and Costs After Major Lung Resection Ann. Thorac. Surg., July 1, 2008; 86(1): 240 - 248. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Brunelli, G. Varela, P. Van Schil, M. Salati, N. Novoa, J. M. Hendriks, M. F. Jimenez, P. Lauwers, and on behalf of the ESTS Audit and Clinical Excellenc Multicentric analysis of performance after major lung resections by using the European Society Objective Score (ESOS) Eur. J. Cardiothorac. Surg., February 1, 2008; 33(2): 284 - 288. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Brunelli, M. Refai, F. Xiume, M. Salati, R. Marasco, V. Sciarra, L. Socci, and A. Sabbatini Oxygen desaturation during maximal stair-climbing test and postoperative complications after major lung resections Eur. J. Cardiothorac. Surg., January 1, 2008; 33(1): 77 - 82. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Brunelli, G. Varela, G. Rocco, L. Socci, N. Novoa, T. Gatani, M. Salati, and A. L. Rocca A model to predict the immediate postoperative FEV1 following major lung resections Eur. J. Cardiothorac. Surg., November 1, 2007; 32(5): 783 - 786. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Brunelli, M. Salati, M. Refai, F. Xiume, G. Rocco, and A. Sabbatini Risk-adjusted econometric model to estimate postoperative costs: An additional instrument for monitoring performance after major lung resection J. Thorac. Cardiovasc. Surg., September 1, 2007; 134(3): 624 - 629. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Varela, A. Brunelli, G. Rocco, M. F. Jimenez, M. Salati, and T. Gatani Evidence of Lower Alteration of Expiratory Volume in Patients With Airflow Limitation in the Immediate Period After Lobectomy Ann. Thorac. Surg., August 1, 2007; 84(2): 417 - 422. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Brunelli, N. J. Morgan-Hughes, M. Refai, M. Salati, A. Sabbatini, and G. Rocco Risk-adjusted morbidity and mortality models to compare the performance of two units after major lung resections J. Thorac. Cardiovasc. Surg., January 1, 2007; 133(1): 88 - 96. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Brunelli, F. Xiume, M. Refai, M. Salati, R. Marasco, V. Sciarra, and A. Sabbatini Evaluation of Expiratory Volume, Diffusion Capacity, and Exercise Tolerance Following Major Lung Resection: A Prospective Follow-up Analysis Chest, January 1, 2007; 131(1): 141 - 147. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| ANN THORAC SURG | ASIAN CARDIOVASC THORAC ANN | EUR J CARDIOTHORAC SURG |
| J THORAC CARDIOVASC SURG | ICVTS | ALL CTSNet JOURNALS |