Adversarial training is widely acknowledged as the most effective defense against adversarial attacks. However, it is also well established that achieving both robustness and generalization in adversarially trained models involves a trade-off. The goal of this work is to provide an in depth comparison of different approaches for adversarial training in language models. Specifically, we study the effect of pre-training data augmentation as well as training time input perturbations vs. embedding space perturbations on the robustness and generalization of BERT-like language models. Our findings suggest that better robustness can be achieved by pre-training data augmentation or by training with input space perturbation. However, training with embedding space perturbation significantly improves generalization. A linguistic correlation analysis of neurons of the learned models reveal that the improved generalization is due to `more specialized' neurons. To the best of our knowledge, this is the first work to carry out a deep qualitative analysis of different methods of generating adversarial examples in adversarial training of language models.
翻译:反向培训被公认为是对抗性攻击的最有效防御手段,然而,在经过对抗性训练的模型中实现稳健性和普遍化也已经得到公认,这需要权衡;这项工作的目标是对语言模型中的对抗性培训的不同方法进行深入比较;具体地说,我们研究培训前数据扩充和培训时间输入的影响,以及将空间扰动与BERT类似语言模型的稳健性和普遍化联系起来。我们的调查结果表明,通过培训前数据扩充或投入空间扰动培训,可以实现更稳健性。然而,与嵌入空间扰动有关的培训大大改进了一般化。对所学模型中神经元进行的语言相关性分析表明,改进的概括性是由于“更专业化”的神经元。根据我们的知识,这是对在语言模型的对抗性培训中产生不同对抗性实例的不同方法进行深入的定性分析的第一项工作。