Rising usage of deep neural networks to perform decision making in critical applications like medical diagnosis and financial analysis have raised concerns regarding their reliability and trustworthiness. As automated systems become more mainstream, it is important their decisions be transparent, reliable and understandable by humans for better trust and confidence. To this effect, concept-based models such as Concept Bottleneck Models (CBMs) and Self-Explaining Neural Networks (SENN) have been proposed which constrain the latent space of a model to represent high level concepts easily understood by domain experts in the field. Although concept-based models promise a good approach to both increasing explainability and reliability, it is yet to be shown if they demonstrate robustness and output consistent concepts under systematic perturbations to their inputs. To better understand performance of concept-based models on curated malicious samples, in this paper, we aim to study their robustness to adversarial perturbations, which are also known as the imperceptible changes to the input data that are crafted by an attacker to fool a well-learned concept-based model. Specifically, we first propose and analyze different malicious attacks to evaluate the security vulnerability of concept based models. Subsequently, we propose a potential general adversarial training-based defense mechanism to increase robustness of these systems to the proposed malicious attacks. Extensive experiments on one synthetic and two real-world datasets demonstrate the effectiveness of the proposed attacks and the defense approach.
翻译:在医学诊断和金融分析等关键应用中,越来越多地使用深层神经网络来进行决策,这引起了人们对其可靠性和可靠性的关切。随着自动化系统日益主流化,其决定必须透明、可靠和为人所理解,以增进信任和信心。为此,人们提议采用基于概念的模式,如 " 概念瓶颈模型 " 和 " 自我探索神经网络 " (SENN),以限制模型的潜在空间,使其无法代表实地域专家容易理解的高层次概念。虽然基于概念的模式有望为增加解释性和可靠性提供一种好的方法,但是,如果自动化系统在对其投入的系统干扰下显示其决定的稳健性和产出一致的概念,则还有待于展示。为了更好地理解在本文中,我们打算研究这些基于概念模型的功能,如 " 概念瓶颈模型 " 和 " 自我探索神经网络 " (SENN),这些模型还被称为攻击者所设计的输入数据数据的不易察觉的改变空间,以欺骗一种基于深层次概念的模式。具体地说,我们首先提出并分析不同的恶意攻击行为,以便评价以系统为基础的防御攻击的可靠程度。我们提议了一种基于战略的防御攻击概念的危险性,然后提出一种潜在的试验。我们提议,用以评价一种防御攻击的防御攻击的可靠性攻击的可能性。