在文本分类中愚弄解释 (Fooling Explanations in Text Classifiers)

State-of-the-art text classification models are becoming increasingly reliant on deep neural networks (DNNs). Due to their black-box nature, faithful and robust explanation methods need to accompany classifiers for deployment in real-life scenarios. However, it has been shown in vision applications that explanation methods are susceptible to local, imperceptible perturbations that can significantly alter the explanations without changing the predicted classes. We show here that the existence of such perturbations extends to text classifiers as well. Specifically, we introduceTextExplanationFooler (TEF), a novel explanation attack algorithm that alters text input samples imperceptibly so that the outcome of widely-used explanation methods changes considerably while leaving classifier predictions unchanged. We evaluate the performance of the attribution robustness estimation performance in TEF on five sequence classification datasets, utilizing three DNN architectures and three transformer architectures for each dataset. TEF can significantly decrease the correlation between unchanged and perturbed input attributions, which shows that all models and explanation methods are susceptible to TEF perturbations. Moreover, we evaluate how the perturbations transfer to other model architectures and attribution methods, and show that TEF perturbations are also effective in scenarios where the target model and explanation method are unknown. Finally, we introduce a semi-universal attack that is able to compute fast, computationally light perturbations with no knowledge of the attacked classifier nor explanation method. Overall, our work shows that explanations in text classifiers are very fragile and users need to carefully address their robustness before relying on them in critical applications.

翻译：最先进的文本分类模型正日益依赖深层神经网络。由于其黑盒性质, 忠实和有力的解释方法需要伴随分类者在真实情景中部署。但是, 在视觉应用中显示, 解释方法的性能很容易适应本地的、无法察觉的扰动, 这样可以大大改变解释, 而不会改变预测的类别。我们在这里显示, 这种扰动的存在也会扩展到文本分类者。具体地说, 我们引入了TextExplainationFooler(TEF), 一种新的解释性算法, 改变文本输入样本的新型解释性能, 因而不易察觉地使广泛使用的解释性方法的结果发生重大变化, 而使分类的预测保持不变。我们评估了五个序列分类数据集的属性稳健度估计性业绩, 利用了三个 DNNEF 架构和三个变异的数据集。 TEF可以显著地减少文字分类和扭曲性输入属性的关联性。具体地说, 所有模型和解释方法都容易受到 TEF perturbrview 的伤害。此外, 我们评估了在选择性解释方法中,, 在精确的模型中, 选择中, 和选择性解释性解释性方法中, 也显示, 选择性解释性解释性的方法也显示, 。