Failure detection in automated image classification is a critical safeguard for clinical deployment. Detected failure cases can be referred to human assessment, ensuring patient safety in computer-aided clinical decision making. Despite its paramount importance, there is insufficient evidence about the ability of state-of-the-art confidence scoring methods to detect test-time failures of classification models in the context of medical imaging. This paper provides a reality check, establishing the performance of in-domain misclassification detection methods, benchmarking 9 widely used confidence scores on 6 medical imaging datasets with different imaging modalities, in multiclass and binary classification settings. Our experiments show that the problem of failure detection is far from being solved. We found that none of the benchmarked advanced methods proposed in the computer vision and machine learning literature can consistently outperform a simple softmax baseline, demonstrating that improved out-of-distribution detection or model calibration do not necessarily translate to improved in-domain misclassification detection. Our developed testbed facilitates future work in this important area
翻译:自动图像分类中的失灵检测是临床部署的关键保障。 检测出失灵的病例可以提交人类评估,确保计算机辅助临床决策中的病人安全。 尽管其至关重要性,但没有足够的证据表明最先进的信任评分方法能够检测医学成像方面的分类模型测试时失灵。 本文提供了一次现实检查,确定了内部分类错误检测方法的性能,将6个具有不同成像模式的医学成像数据集在多级和二进制分类环境中广泛使用的9个信任评分基准设定为基准。 我们的实验显示,故障检测问题远未解决。 我们发现,计算机视觉和机器学习文献中所提出的基准先进方法没有一个能够始终超过简单的软式基准,这表明改进的配送检测或模型校准不一定转化成改进了内部分类错误检测。 我们开发的测试台为今后在这一重要领域的工作提供了便利。