Deep learning based systems are susceptible to adversarial attacks, where a small, imperceptible change at the input alters the model prediction. However, to date the majority of the approaches to detect these attacks have been designed for image processing systems. Many popular image adversarial detection approaches are able to identify adversarial examples from embedding feature spaces, whilst in the NLP domain existing state of the art detection approaches solely focus on input text features, without consideration of model embedding spaces. This work examines what differences result when porting these image designed strategies to Natural Language Processing (NLP) tasks - these detectors are found to not port over well. This is expected as NLP systems have a very different form of input: discrete and sequential in nature, rather than the continuous and fixed size inputs for images. As an equivalent model-focused NLP detection approach, this work proposes a simple sentence-embedding "residue" based detector to identify adversarial examples. On many tasks, it out-performs ported image domain detectors and recent state of the art NLP specific detectors.
翻译:深层次的学习基础系统很容易受到对抗性攻击,输入时的微小、无法察觉的变化改变了模型预测。然而,迄今为止,大多数检测这些攻击的方法都是为图像处理系统设计的。许多流行的图像对抗性探测方法能够辨别嵌入特征空间的对抗性实例,而在国家定位方案领域,现有先进状态探测方法仅侧重于输入文本特征,而没有考虑模型嵌入空间。这项工作考察了将这些图像设计战略移植到自然语言处理任务时的差别结果,发现这些探测器的端口不好。由于国家定位方案系统有非常不同的输入形式:性质上离散和相继,而不是图像的连续和固定大小输入。作为类似的以模型为重点的国家定位方案探测方法,这项工作提出了一种简单的、以句封装为主的“过期”探测器,用以识别对抗性实例。在许多任务中,它超越了移植图像域探测器的外观,以及国家定位方案专门探测器的最近状态。