Recent studies on adversarial images have shown that they tend to leave the underlying low-dimensional data manifold, making them significantly more challenging for current models to make correct predictions. This so-called off-manifold conjecture has inspired a novel line of defenses against adversarial attacks on images. In this study, we find a similar phenomenon occurs in the contextualized embedding space induced by pretrained language models, in which adversarial texts tend to have their embeddings diverge from the manifold of natural ones. Based on this finding, we propose Textual Manifold-based Defense (TMD), a defense mechanism that projects text embeddings onto an approximated embedding manifold before classification. It reduces the complexity of potential adversarial examples, which ultimately enhances the robustness of the protected model. Through extensive experiments, our method consistently and significantly outperforms previous defenses under various attack settings without trading off clean accuracy. To the best of our knowledge, this is the first NLP defense that leverages the manifold structure against adversarial attacks. Our code is available at \url{https://github.com/dangne/tmd}.
翻译:最近关于对抗性图像的研究显示,它们往往会留下基本的低维数据方块,因此对当前模型作出正确预测的挑战性要大得多。 这种所谓的非玩耍式猜想激发了针对对图像的对抗性攻击的新型防御线。 在这项研究中,我们发现一种类似的现象发生在预先培训的语言模型所引出的环境化嵌入空间中,在这种空间中,对抗性文本的嵌入往往与自然数据方块的不同。根据这一发现,我们提出了基于文字的人工拼图防御(TMD),这是一种防御机制,在分类之前将文字嵌入一个近似嵌入的嵌入器。它降低了潜在对抗性范例的复杂性,最终增强了受保护模型的稳健性。通过广泛的实验,我们的方法始终且大大超越了在各种攻击环境中的前防御系统,而没有以干净的准确性进行交易。据我们所知,这是第一个利用多重结构对抗性攻击的NLP防御系统。我们的代码可在以下网站查阅:https://github.com/dangne/tm}。