Explainable machine learning has become increasingly prevalent, especially in healthcare where explainable models are vital for ethical and trusted automated decision making. Work on the susceptibility of deep learning models to adversarial attacks has shown the ease of designing samples to mislead a model into making incorrect predictions. In this work, we propose a model agnostic explainability-based method for the accurate detection of adversarial samples on two datasets with different complexity and properties: Electronic Health Record (EHR) and chest X-ray (CXR) data. On the MIMIC-III and Henan-Renmin EHR datasets, we report a detection accuracy of 77% against the Longitudinal Adversarial Attack. On the MIMIC-CXR dataset, we achieve an accuracy of 88%; significantly improving on the state of the art of adversarial detection in both datasets by over 10% in all settings. We propose an anomaly detection based method using explainability techniques to detect adversarial samples which is able to generalise to different attack methods without a need for retraining.
翻译:可解释的机器学习越来越普遍,特别是在保健领域,对道德和可信赖的自动化决策至关重要的可解释模型在保健领域尤为普遍。关于深层次学习模型对对抗性攻击的易感性的工作表明,设计样本很容易使模型误入不正确的预测。在这项工作中,我们提出了一个基于不可解释性的解释性模型方法,用于准确检测两个数据组的敌对性样本,这两个数据组具有不同复杂和特性:电子健康记录和胸部X射线数据。关于MIMIC-III和河南-雷明 EHR数据集,我们报告,对长度反向攻击的探测精确度为77%。在MIC-CXR数据集中,我们实现了88%的精确度;在所有环境中,两个数据集的对抗性探测能力都大大改进了10%以上。我们提出了一种基于可解释性探测方法,用解释性技术探测对抗性样品,可以对不同攻击方法进行概括,而无需再培训。