A major obstacle to the integration of deep learning models for chest x-ray interpretation into clinical settings is the lack of understanding of their failure modes. In this work, we first investigate whether there are patient subgroups that chest x-ray models are likely to misclassify. We find that patient age and the radiographic finding of lung lesion or pneumothorax are statistically relevant features for predicting misclassification for some chest x-ray models. Second, we develop misclassification predictors on chest x-ray models using their outputs and clinical features. We find that our best performing misclassification identifier achieves an AUROC close to 0.9 for most diseases. Third, employing our misclassification identifiers, we develop a corrective algorithm to selectively flip model predictions that have high likelihood of misclassification at inference time. We observe F1 improvement on the prediction of Consolidation (0.008 [95\% CI 0.005, 0.010]) and Edema (0.003, [95\% CI 0.001, 0.006]). By carrying out our investigation on ten distinct and high-performing chest x-ray models, we are able to derive insights across model architectures and offer a generalizable framework applicable to other medical imaging tasks.
翻译:将胸前X射线解释的深学习模型纳入临床环境的一个主要障碍是缺乏对其失败模式的理解。在这项工作中,我们首先调查是否有胸前X射线模型可能分类错误的病人分组;我们发现,病人年龄和肺损伤或肺炎球菌的放射调查结果是统计学上与预测某些胸前X射线模型分类错误有关的特征;第二,我们利用胸部X射线模型的产出和临床特征,在胸前X射线模型上开发错误分类预测仪;我们发现,我们最佳的分类识别仪在大多数疾病上接近0.9的AUROC。第三,我们利用错误分类识别仪,我们开发了一种纠正算法,选择极有可能在推断时间误分类的翻转模型预测;我们观察到,对合并预测(0.008 [95 CI 0.005, 0.010) 和爱德马(0.003,[95 CI0.001,0.006] ) 进行了错误分类预测;我们通过对10个不同和高表现的胸部X射线模型进行调查,我们能够对模型进行洞察,并提供一个适用于其他基本框架。