Deep learning (DL) can aid doctors in detecting worsening patient states early, affording them time to react and prevent bad outcomes. While DL-based early warning models usually work well in the hospitals they were trained for, they tend to be less reliable when applied at new hospitals. This makes it difficult to deploy them at scale. Using carefully harmonised intensive care data from four data sources across Europe and the US (totalling 334,812 stays), we systematically assessed the reliability of DL models for three common adverse events: death, acute kidney injury (AKI), and sepsis. We tested whether using more than one data source and/or explicitly optimising for generalisability during training improves model performance at new hospitals. We found that models achieved high AUROC for mortality (0.838-0.869), AKI (0.823-0.866), and sepsis (0.749-0.824) at the training hospital. As expected, performance dropped at new hospitals, sometimes by as much as -0.200. Using more than one data source for training mitigated the performance drop, with multi-source models performing roughly on par with the best single-source model. This suggests that as data from more hospitals become available for training, model robustness is likely to increase, lower-bounding robustness with the performance of the most applicable data source in the training data. Dedicated methods promoting generalisability did not noticeably improve performance in our experiments.
翻译:深度学习(DL)可以帮助医生及早检测患者情况的恶化,为他们提供时间做出反应并防止不良结果。虽然基于DL的早期预警模型通常在他们接受训练的医院中工作得很好,但当应用于新医院时,它们往往不太可靠。这使得难以大规模部署它们。利用来自欧洲和美国的四个数据源的精心协调的重症监护数据(合计334,812次住院),我们系统地评估了DL模型对三种常见不良事件(死亡、急性肾损伤(AKI)和败血症)的可靠性。我们测试了是否使用多个数据源和/或在训练过程中明确优化通用性可以改善模型在新医院的性能。我们发现模型在接受训练的医院中对死亡(0.838-0.869)、AKI(0.823-0.866)和败血症(0.749-0.824)的AUROC表现良好。如预期的那样,性能在新医院中下降,有时甚至下降了-0.200。使用多个数据源进行训练可以减轻性能下降,多源模型的性能与最佳单源模型相当。这表明随着更多医院的数据可供训练,模型的鲁棒性可能会增加,将鲁棒性下限与训练数据中最适用的数据源的性能相匹配。在我们的实验中,专门促进通用性的方法并没有显着提高性能。