Reliable deployment of machine learning models such as neural networks continues to be challenging due to several limitations. Some of the main shortcomings are the lack of interpretability and the lack of robustness against adversarial examples or out-of-distribution inputs. In this paper, we explore the possibilities and limits of adversarial attacks for explainable machine learning models. First, we extend the notion of adversarial examples to fit in explainable machine learning scenarios, in which the inputs, the output classifications and the explanations of the model's decisions are assessed by humans. Next, we propose a comprehensive framework to study whether (and how) adversarial examples can be generated for explainable models under human assessment, introducing novel attack paradigms. In particular, our framework considers a wide range of relevant (yet often ignored) factors such as the type of problem, the user expertise or the objective of the explanations in order to identify the attack strategies that should be adopted in each scenario to successfully deceive the model (and the human). These contributions intend to serve as a basis for a more rigorous and realistic study of adversarial examples in the field of explainable machine learning.
翻译:由于若干限制,可靠地部署神经网络等机器学习模型仍然具有挑战性,其中的一些主要缺点是缺乏解释性,对对抗性实例或分配外的投入缺乏强力。在本文件中,我们探讨了可解释的机器学习模型的对抗性攻击的可能性和限度。首先,我们扩展了对抗性例子的概念,以适应可解释的机器学习假设,在这种假设中,投入、产出分类和模型决定的解释由人来评估。接着,我们提议了一个全面框架,研究是否可以(以及如何)产生对抗性例子,用于在人类评估中解释可解释的模式,引入新的攻击范式。特别是,我们的框架考虑了一系列相关的(经常被忽视的)因素,如问题的类型、用户专门知识或解释的目的,以便确定在每个假设中应当采用的攻击战略,成功地欺骗模型(和人)。这些贡献的目的是要作为基础,对可解释的机器学习领域的对抗性例子进行更加严格和现实的研究。