Vulnerability to adversarial attacks is a well-known weakness of Deep Neural Networks. While most of the studies focus on natural images with standardized benchmarks like ImageNet and CIFAR, little research has considered real world applications, in particular in the medical domain. Our research shows that, contrary to previous claims, robustness of chest x-ray classification is much harder to evaluate and leads to very different assessments based on the dataset, the architecture and robustness metric. We argue that previous studies did not take into account the peculiarity of medical diagnosis, like the co-occurrence of diseases, the disagreement of labellers (domain experts), the threat model of the attacks and the risk implications for each successful attack. In this paper, we discuss the methodological foundations, review the pitfalls and best practices, and suggest new methodological considerations for evaluating the robustness of chest xray classification models. Our evaluation on 3 datasets, 7 models, and 18 diseases is the largest evaluation of robustness of chest x-ray classification models.
翻译:虽然大多数研究侧重于具有标准基准的自然图像,如图像网络和CIFAR,但几乎没有研究考虑到真实世界的应用,特别是在医疗领域。我们的研究表明,与以前的说法相反,胸前X射线分类的稳健性很难评估,并导致根据数据集、结构和稳健度衡量标准进行截然不同的评估。我们争辩说,以前的研究没有考虑到医学诊断的特殊性,如疾病共同发生、标签人(主专家)的分歧、攻击的威胁模型和每次成功攻击的风险影响。我们在本文件中讨论了方法基础,审查了陷阱和最佳做法,提出了评估胸前X射线分类模型稳健性的新的方法考虑。我们对3个数据集、7个模型和18个疾病的评价是对胸部X射线分类模型稳健性的最大评价。