Interpretability methods like Integrated Gradient and LIME are popular choices for explaining natural language model predictions with relative word importance scores. These interpretations need to be robust for trustworthy NLP applications in high-stake areas like medicine or finance. Our paper demonstrates how interpretations can be manipulated by making simple word perturbations on an input text. Via a small portion of word-level swaps, these adversarial perturbations aim to make the resulting text semantically and spatially similar to its seed input (therefore sharing similar interpretations). Simultaneously, the generated examples achieve the same prediction label as the seed yet are given a substantially different explanation by the interpretation methods. Our experiments generate fragile interpretations to attack two SOTA interpretation methods, across three popular Transformer models and on two different NLP datasets. We observe that the rank order correlation drops by over 20% when less than 10% of words are perturbed on average. Further, rank-order correlation keeps decreasing as more words get perturbed. Furthermore, we demonstrate that candidates generated from our method have good quality metrics.
翻译:综合梯度和 LIME 等解释性方法在解释自然语言模型预测的自然语言模型和相对单词重要性分数方面是受欢迎的选择。 这些解释对于医学或金融等高占用地区值得信赖的 NLP 应用来说需要强力。 我们的文件展示了如何通过对输入文本进行简单的单词扰动来操纵解释。 通过一小部份的单词级交换, 这些对称扰动的目的是使由此产生的文本在语义和空间上与种子输入相似( 从而共享相似的解释 ) 。 同时, 生成的示例在预测标签上与种子相同, 却通过解释方法给出了完全不同的解释。 我们的实验产生了两种 SOTA 解释方法的脆弱解释, 跨越三个流行的变异模型和两个不同的 NLP 数据集。 我们观察到, 当平均的单位关系小于10%时, 排序相关性会下降20%以上。 此外, 级序相关性随着更多的单词被侵入, 不断下降。 此外, 我们证明我们的方法产生的候选人具有良好的质量指标 。