Performance monitoring of machine learning (ML)-based risk prediction models in healthcare is complicated by the issue of confounding medical interventions (CMI): when an algorithm predicts a patient to be at high risk for an adverse event, clinicians are more likely to administer prophylactic treatment and alter the very target that the algorithm aims to predict. A simple approach is to ignore CMI and monitor only the untreated patients, whose outcomes remain unaltered. In general, ignoring CMI may inflate Type I error because (i) untreated patients disproportionally represent those with low predicted risk and (ii) evolution in both the model and clinician trust in the model can induce complex dependencies that violate standard assumptions. Nevertheless, we show that valid inference is still possible if one monitors conditional performance and if either conditional exchangeability or time-constant selection bias hold. Specifically, we develop a new score-based cumulative sum (CUSUM) monitoring procedure with dynamic control limits. Through simulations, we demonstrate the benefits of combining model updating with monitoring and investigate how over-trust in a prediction model may delay detection of performance deterioration. Finally, we illustrate how these monitoring methods can be used to detect calibration decay of an ML-based risk calculator for postoperative nausea and vomiting during the COVID-19 pandemic.
翻译:摘要:在医疗保健领域中对基于机器学习(ML)的风险预测模型进行性能监测时,混淆医疗干预(CMI)问题使这一过程变得复杂:当算法预测患者面临不良事件的高风险时,临床医生更有可能给予预防性治疗并改变算法旨在预测的目标。简单的做法是忽略干预过程,仅监测未接受治疗的患者,他们的结果保持不变。总体来说,忽略干预过程可能会引起I型错误的膨胀,因为(i)未接受治疗的患者不成比例地代表低风险预测的人群,以及(ii)模型和临床医生对模型的信任程度的变化会产生违反标准假设的复杂依赖性。尽管如此,我们表明在监测条件下的性能指标是可行的,并且如果条件交换性或时间不变的选择偏倚是成立的,那么仍然可以进行有效的推断。具体地,我们开发了一种新的基于得分的累积和(CUSUM)监控程序,配合动态控制界限。通过实验,我们展示了将模型更新与监测相结合的好处,并探讨了对预测模型的过度信任可能会延迟性能恶化的检测。最后,我们阐述了这些监测方法如何用于检测COVID-19大流行期间基于ML的术后恶心和呕吐风险计算器的校准衰减。