In this paper, we present an approach to improve the robustness of BERT language models against word substitution-based adversarial attacks by leveraging adversarial perturbations for self-supervised contrastive learning. We create a word-level adversarial attack generating hard positives on-the-fly as adversarial examples during contrastive learning. In contrast to previous works, our method improves model robustness without using any labeled data. Experimental results show that our method improves robustness of BERT against four different word substitution-based adversarial attacks, and combining our method with adversarial training gives higher robustness than adversarial training alone. As our method improves the robustness of BERT purely with unlabeled data, it opens up the possibility of using large text datasets to train robust language models against word substitution-based adversarial attacks.
翻译:在本文中,我们展示了一种方法,通过利用对抗性干扰进行自我监督的对比性学习,提高BERT语言模型对以词替代的对抗性攻击的稳健性;我们创建了一种单词水平的对抗性攻击,在对比性学习中作为对抗性例子,产生实际的硬正数;与以往的工作不同,我们的方法在不使用任何标签数据的情况下改善了模型的稳健性;实验结果显示,我们的方法提高了BERT对以词替代的四种不同的对抗性攻击的稳健性,并且把我们的方法与对抗性训练结合起来,单是比对抗性训练的强。由于我们的方法纯粹用未标注的数据改进了BERT的稳健性,因此它开辟了使用大型文本数据集来训练抵御以词替代为基础的对抗性攻击的稳健语言模型的可能性。