Explanations are crucial parts of deep neural network (DNN) classifiers. In high stakes applications, faithful and robust explanations are important to understand and gain trust in DNN classifiers. However, recent work has shown that state-of-the-art attribution methods in text classifiers are susceptible to imperceptible adversarial perturbations that alter explanations significantly while maintaining the correct prediction outcome. If undetected, this can critically mislead the users of DNNs. Thus, it is crucial to understand the influence of such adversarial perturbations on the networks' explanations and their perceptibility. In this work, we establish a novel definition of attribution robustness (AR) in text classification, based on Lipschitz continuity. Crucially, it reflects both attribution change induced by adversarial input alterations and perceptibility of such alterations. Moreover, we introduce a wide set of text similarity measures to effectively capture locality between two text samples and imperceptibility of adversarial perturbations in text. We then propose our novel TransformerExplanationAttack (TEA), a strong adversary that provides a tight estimation for attribution robustness in text classification. TEA uses state-of-the-art language models to extract word substitutions that result in fluent, contextual adversarial samples. Finally, with experiments on several text classification architectures, we show that TEA consistently outperforms current state-of-the-art AR estimators, yielding perturbations that alter explanations to a greater extent while being more fluent and less perceptible.
翻译:解释是深神经网络(DNN)分类的关键部分。 在高利贷应用程序中,忠诚和有力的解释对于理解和获得对DNN分类者的信任十分重要。 但是,最近的工作表明,文本分类者中最先进的归属方法很容易被察觉到,会大大改变解释,同时保持正确的预测结果。如果不被察觉,这将严重误导DNN的用户。因此,了解这种对抗性干扰对网络解释及其可感知性的影响至关重要。在这项工作中,我们根据Lipschitz的连续性,在文本分类中建立一个关于归属稳健度的新定义(AR)。 关键地是,它既反映了由对抗性输入改变引起的归属变化,又反映了这种改变的可感知性。 此外,我们引入了一套广泛的文本相似性措施,以有效捕捉两个文本样本之间的位置和对立性干扰的不易感知性。 因此,我们提出我们的新版本变换版本Atacktack(TEA),一个强大的对文本分类进行更精确的对文本定义的精确性定义定义,在文本分类中,最后对定位性进行更精确的版本的版本进行精确的版本的版本的版本,对正版的文本结构进行更精确的文本结构,对文本进行更精确的版本,在文本的版本的版本的版本的版本的文本的文本的模型中,在文本分类中,在文本的版本中,在文本的排序中,在文本中,在文本的排序中,在文本的排序中,在文本分类中进行更精确的模型中,在文本中进行更精确的模型中,在文本中,在文本分类中进行更精确的模型中进行更精确的模型中进行更精确的顺序上进行。