Although deep neural networks have achieved state-of-the-art performance in various machine learning tasks, adversarial examples, constructed by adding small non-random perturbations to correctly classified inputs, successfully fool highly expressive deep classifiers into incorrect predictions. Approaches to adversarial attacks in natural language tasks have boomed in the last five years using character-level, word-level, phrase-level, or sentence-level textual perturbations. While there is some work in NLP on defending against such attacks through proactive methods, like adversarial training, there is to our knowledge no effective general reactive approaches to defence via detection of textual adversarial examples such as is found in the image processing literature. In this paper, we propose two new reactive methods for NLP to fill this gap, which unlike the few limited application baselines from NLP are based entirely on distribution characteristics of learned representations: we adapt one from the image processing literature (Local Intrinsic Dimensionality (LID)), and propose a novel one (MultiDistance Representation Ensemble Method (MDRE)). Adapted LID and MDRE obtain state-of-the-art results on character-level, word-level, and phrase-level attacks on the IMDB dataset as well as on the later two with respect to the MultiNLI dataset. For future research, we publish our code.
翻译:尽管深心神经网络在各种机器学习任务中取得了最先进的表现,但通过增加小型非随机的干扰以正确分类输入,建立对抗性实例,成功地将高度直观的深度分类者愚弄到不正确的预测中。过去5年,自然语言任务中的对抗性攻击方法在使用品格级别、字级、字级、句级或句级文字扰动方面有了发展。虽然国家实验室在通过积极主动的方法(如对抗性培训)来防范这类攻击方面做了一些工作,但据我们了解,没有通过探测图像处理文献中发现的文字对抗性实例来采取有效的一般应对性防御方法。在本文件中,我们为国家实验室提出了两种新的应对性方法,以填补这一空白,这不同于国家实验室少数有限的应用基线,完全基于所学表现的分布特征:我们从图像处理文献(地方 Intrins Districality(LID))中改编成一个,并提出了一个新的(MultiDPresent Specition Agrammed (MDRE)) 。调整了LID 和MRE 在以后的版本中,我们关于攻击性定义等级的多语言研究, 和多语言的版本,我们作为以后的版本的版本的版本,我们关于我们关于攻击的版本的版本的版本的版本的版本的版本,我们关于数字的版本的版本的版本,我们获得了对等的版本。