Adversarial attacks in NLP challenge the way we look at language models. The goal of this kind of adversarial attack is to modify the input text to fool a classifier while maintaining the original meaning of the text. Although most existing adversarial attacks claim to fulfill the constraint of semantics preservation, careful scrutiny shows otherwise. We show that the problem lies in the text encoders used to determine the similarity of adversarial examples, specifically in the way they are trained. Unsupervised training methods make these encoders more susceptible to problems with antonym recognition. To overcome this, we introduce a simple, fully supervised sentence embedding technique called Semantics-Preserving-Encoder (SPE). The results show that our solution minimizes the variation in the meaning of the adversarial examples generated. It also significantly improves the overall quality of adversarial examples, as confirmed by human evaluators. Furthermore, it can be used as a component in any existing attack to speed up its execution while maintaining similar attack success.
翻译:语言模型中的对立攻击挑战我们如何看待语言模型。 这种对抗性攻击的目标是修改输入文本,以愚弄一个分类者,同时保持文字的原始含义。 虽然大多数现有的对抗性攻击声称要满足语义保护的制约, 但仔细的仔细检查却显示了其他情况。 我们显示,问题在于用于确定对抗性例子相似性的文字编码器, 特别是培训方式。 未经监督的培训方法使得这些编码器更容易受到反语识别问题的影响。 为了克服这一点, 我们引入了一个简单、 完全受监督的嵌入语言技术( SPE ) 。 结果表明, 我们的解决方案最大限度地减少了所产生对抗性例子的含义上的差异。 它还极大地提高了对抗性例子的总体质量, 人类评价者也证实了这一点。 此外, 在任何现有的攻击中, 都可以使用它来加速其执行, 同时保持类似的攻击成功 。