This paper improves the robustness of the pretrained language model BERT against word substitution-based adversarial attacks by leveraging self-supervised contrastive learning with adversarial perturbations. One advantage of our method compared to previous works is that it is capable of improving model robustness without using any labels. Additionally, we also create an adversarial attack for word-level adversarial training on BERT. The attack is efficient, allowing adversarial training for BERT on adversarial examples generated on the fly during training. Experimental results on four datasets show that our method improves the robustness of BERT against four different word substitution-based adversarial attacks. Furthermore, to understand why our method can improve the model robustness against adversarial attacks, we study vector representations of clean examples and their corresponding adversarial examples before and after applying our method. As our method improves model robustness with unlabeled raw data, it opens up the possibility of using large text datasets to train robust language models.
翻译:本文通过利用自我监督的对比性学习与对抗性扰动,提高了预先培训的语言模型BERT抵御基于文字替代的对抗性攻击的稳健性。 我们的方法与以前的工作相比的一个优点是,它能够提高模型的稳健性,而不使用任何标签。 此外,我们还为BERT的字级对抗性培训创建了对抗性攻击。 这次攻击是有效的,允许BERT在训练期间对飞行上产生的对抗性例子进行对抗性训练。 四个数据集的实验结果表明,我们的方法提高了BERT对四种不同词替代性对抗性攻击的稳健性。此外,为了理解我们的方法为什么能够改进对对抗性攻击的模型的稳健性,我们研究了在应用方法之前和之后对清洁实例及其对应的对抗性例子的矢量说明。由于我们的方法改进了无标记原始数据模型的稳健性,因此可以使用大文本数据集来培训稳健的语言模型。