Recent advances in training deep learning models have demonstrated the potential to provide accurate chest X-ray interpretation and increase access to radiology expertise. However, poor generalization due to data distribution shifts in clinical settings is a key barrier to implementation. In this study, we measured the diagnostic performance for 8 different chest X-ray models when applied to (1) smartphone photos of chest X-rays and (2) external datasets without any finetuning. All models were developed by different groups and submitted to the CheXpert challenge, and re-applied to test datasets without further tuning. We found that (1) on photos of chest X-rays, all 8 models experienced a statistically significant drop in task performance, but only 3 performed significantly worse than radiologists on average, and (2) on the external set, none of the models performed statistically significantly worse than radiologists, and five models performed statistically significantly better than radiologists. Our results demonstrate that some chest X-ray models, under clinically relevant distribution shifts, were comparable to radiologists while other models were not. Future work should investigate aspects of model training procedures and dataset collection that influence generalization in the presence of data distribution shifts.
翻译:培训深层学习模型的最近进展表明,有可能提供准确的胸腔X射线解释,并增加获得放射专门知识的机会,然而,临床环境中数据分布变化导致的概括化不力,是妨碍执行的一个关键障碍。在本研究中,我们测量了8个不同的胸腔X射线模型的诊断性能,这些模型应用到:(1) 胸X射线智能手机照片和(2) 未经微调的外部数据集;所有模型都是由不同群体开发的,并提交给CheXpert的挑战,并被重新应用到测试数据集而无需进一步调整。我们发现:(1) 在胸部X射线照片方面,所有8个模型都经历了统计上显著的工作表现下降,但只有3个模型的性能明显低于平均放射学家;(2) 在外部设备中,没有一个模型的诊断性能比放射学家差很多,五个模型在统计上的表现比放射科医生要好得多。我们的结果显示,一些胸X射线模型,在与临床相关的分布变化下,与放射科医生相当,而其他模型则没有可比性。今后的工作应当调查模型培训程序和数据收集的各个方面,从而影响数据分布变化的总体情况。