The availability of large public datasets and the increased amount of computing power have shifted the interest of the medical community to high-performance algorithms. However, little attention is paid to the quality of the data and their annotations. High performance on benchmark datasets may be reported without considering possible shortcuts or artifacts in the data, besides, models are not tested on subpopulation groups. With this work, we aim to raise awareness about shortcuts problems. We validate previous findings, and present a case study on chest X-rays using two publicly available datasets. We share annotations for a subset of pneumothorax images with drains. We conclude with general recommendations for medical image classification.
翻译:大量公共数据集的可用性以及计算能力的增加,使医疗界的兴趣转向高性能算法,但很少注意数据及其说明的质量。基准数据集的高性能报告可能不考虑数据中可能的捷径或手工艺品,此外,模型也不在亚人口组中进行测试。通过这项工作,我们的目标是提高对捷径问题的认识。我们验证了以前的调查结果,并利用两个公开的数据集对胸部X光进行了案例研究。我们分享了一组带有排水沟的肺炎球菌图像的说明。我们最后提出了关于医学图像分类的一般性建议。