Explanation methods have emerged as an important tool to highlight the features responsible for the predictions of neural networks. There is mounting evidence that many explanation methods are rather unreliable and susceptible to malicious manipulations. In this paper, we particularly aim to understand the robustness of explanation methods in the context of text modality. We provide initial insights and results towards devising a successful adversarial attack against text explanations. To our knowledge, this is the first attempt to evaluate the adversarial robustness of an explanation method. Our experiments show the explanation method can be largely disturbed for up to 86% of the tested samples with small changes in the input sentence and its semantics.
翻译:解释方法已成为强调神经网络预测所负责任特征的重要工具。越来越多的证据表明,许多解释方法相当不可靠,容易受到恶意操纵。在本文中,我们特别要了解在文本模式方面解释方法的稳健性。我们为设计成功的对文本解释的对抗性攻击提供了初步的洞察力和结果。据我们所知,这是第一次试图评价解释方法的对抗性强力。我们的实验显示,在经过测试的样本中,多达86%的样本在输入句及其语义上稍有改动的情况下,解释方法在很大程度上会受到干扰。