注意业绩差距:在预期验证期间审查数据集变化 (Mind the Performance Gap: Examining Dataset Shift During Prospective Validation)

Once integrated into clinical care, patient risk stratification models may perform worse compared to their retrospective performance. To date, it is widely accepted that performance will degrade over time due to changes in care processes and patient populations. However, the extent to which this occurs is poorly understood, in part because few researchers report prospective validation performance. In this study, we compare the 2020-2021 ('20-'21) prospective performance of a patient risk stratification model for predicting healthcare-associated infections to a 2019-2020 ('19-'20) retrospective validation of the same model. We define the difference in retrospective and prospective performance as the performance gap. We estimate how i) "temporal shift", i.e., changes in clinical workflows and patient populations, and ii) "infrastructure shift", i.e., changes in access, extraction and transformation of data, both contribute to the performance gap. Applied prospectively to 26,864 hospital encounters during a twelve-month period from July 2020 to June 2021, the model achieved an area under the receiver operating characteristic curve (AUROC) of 0.767 (95% confidence interval (CI): 0.737, 0.801) and a Brier score of 0.189 (95% CI: 0.186, 0.191). Prospective performance decreased slightly compared to '19-'20 retrospective performance, in which the model achieved an AUROC of 0.778 (95% CI: 0.744, 0.815) and a Brier score of 0.163 (95% CI: 0.161, 0.165). The resulting performance gap was primarily due to infrastructure shift and not temporal shift. So long as we continue to develop and validate models using data stored in large research data warehouses, we must consider differences in how and when data are accessed, measure how these differences may affect prospective performance, and work to mitigate those differences.

翻译：与临床护理相结合后,病人风险分层模型可能比其追溯性能表现更差。至今,人们普遍认为,由于护理过程和病人人数的变化,业绩将随着时间推移而下降。然而,对这种情况的了解程度不甚清楚,部分原因是很少有研究人员报告预期验证业绩。在这次研究中,我们将2020-2021年(20—21)病人风险分层模型的预期表现比2019-2020年(“19—20”)预测与临床护理相关的感染情况(“19—20”)对同一模型进行追溯性能验证。我们将追溯性和预期性能的差异定义为绩效差距。我们估计i)“时间变换”,即临床工作流程和病人人数的变化,以及(二)“基础设施的变化”,即数据获取、提取和转换数据的变化,均有助于绩效差距。将2020年7月至2021年12个月(“19—20—20”)的医院风险分级指数应用到20—20”,模型将接收者运行特征曲线(AUROC)下的差异定义为0.667(CI) (CI) (95% 信心期间隔期间,结果产生结果为0.895,CIA) 数据,运行数据主要是使用0.1801) 数据,运行数据,运行数据下降数据。