Recent works have shown that powerful pre-trained language models (PLM) can be fooled by small perturbations or intentional attacks. To solve this issue, various data augmentation techniques are proposed to improve the robustness of PLMs. However, it is still challenging to augment semantically relevant examples with sufficient diversity. In this work, we present Virtual Data Augmentation (VDA), a general framework for robustly fine-tuning PLMs. Based on the original token embeddings, we construct a multinomial mixture for augmenting virtual data embeddings, where a masked language model guarantees the semantic relevance and the Gaussian noise provides the augmentation diversity. Furthermore, a regularized training strategy is proposed to balance the two aspects. Extensive experiments on six datasets show that our approach is able to improve the robustness of PLMs and alleviate the performance degradation under adversarial attacks. Our codes and data are publicly available at \textcolor{blue}{\url{https://github.com/RUCAIBox/VDA}}.
翻译:最近的工作表明,强大的预先培训语言模型(PLM)可能被小规模扰动或蓄意攻击所蒙骗。为了解决这一问题,提出了各种数据增强技术,以提高PLM的稳健性。然而,用足够多样性的方式增加语义相关的例子仍然具有挑战性。在这项工作中,我们介绍了虚拟数据增强(VDA),这是大力微调PLMs的一个总体框架。根据原始象征性嵌入,我们建立了一个多名混合体,用于增加虚拟数据嵌入,其中隐蔽语言模型保证语义相关性,高山噪音提供增强性多样化。此外,还提出了常规化的培训战略,以平衡这两个方面。对六个数据集的广泛实验表明,我们的方法能够改善PLMs的稳健性和缓解对抗性能的退化。我们的代码和数据在 textcolora{bluueunurl{https://github.com/RUCIBox/VDA ⁇ 中公开提供。