Recently it has been shown that state-of-the-art NLP models are vulnerable to adversarial attacks, where the predictions of a model can be drastically altered by slight modifications to the input (such as synonym substitutions). While several defense techniques have been proposed, and adapted, to the discrete nature of text adversarial attacks, the benefits of general-purpose regularization methods such as label smoothing for language models, have not been studied. In this paper, we study the adversarial robustness provided by various label smoothing strategies in foundational models for diverse NLP tasks in both in-domain and out-of-domain settings. Our experiments show that label smoothing significantly improves adversarial robustness in pre-trained models like BERT, against various popular attacks. We also analyze the relationship between prediction confidence and robustness, showing that label smoothing reduces over-confident errors on adversarial examples.
翻译:最近,人们发现,最先进的NLP模式很容易受到对抗性攻击,对模型的预测可以通过对输入的微小修改(如同义替代)而急剧改变。 虽然提出了几种防御技术,并适应了文本对抗性攻击的离散性质,但诸如语言模型的平滑标签等一般用途规范化方法的好处还没有研究过。在本文件中,我们研究了不同NLP任务的基础性模型中各种标签平滑战略所提供的对抗性强健性,无论是在内部还是场外设置。我们的实验表明,在诸如BERT等预先训练过的模型中,对各种受人攻击的对抗性强健性有了显著改善。我们还分析了预测信心和稳健性之间的关系,表明标签平滑减少了对抗性例子上的过度自信错误。