Adversarial training (AT) is one of the most reliable methods for defending against adversarial attacks in machine learning. Variants of this method have been used as regularization mechanisms to achieve SOTA results on NLP benchmarks, and they have been found to be useful for transfer learning and continual learning. We search for the reasons for the effectiveness of AT by contrasting vanilla and adversarially fine-tuned BERT models. We identify partial preservation of BERT's syntactic abilities during fine-tuning as the key to the success of AT. We observe that adversarially fine-tuned models remain more faithful to BERT's language modeling behavior and are more sensitive to the word order. As concrete examples of syntactic abilities, an adversarially fine-tuned model could have an advantage of up to 38% on anaphora agreement and up to 11% on dependency parsing. Our analysis demonstrates that vanilla fine-tuning oversimplifies the sentence representation by focusing heavily on a small subset of words. AT, however, moderates the effect of these influential words and encourages representational diversity. This allows for a more hierarchical representation of a sentence and leads to the mitigation of BERT's loss of syntactic abilities.
翻译:Aversarial 培训(AT)是防止机器学习中对抗性攻击的最可靠方法之一,这种方法的变式已被作为正规化机制,用于在NLP基准上实现SOTA结果,这些变式被认为对转移学习和继续学习有用。我们通过对比香草和对抗性微调BERT模型寻找AT有效性的原因。我们发现,微调时部分保留BERT的综合能力是AT成功的关键。我们注意到,对抗性微调模式仍然更忠实于BERT的语言模拟行为,并且更敏感于单词顺序。作为合成能力的具体例子,对抗性微调模式的优点可能是在Anaphora协议上高达38%,在依赖性对等上高达11%。我们的分析表明,Vanilla微调了刑罚代表的简单化,主要侧重于一小组词。但是,这些有影响力的字眼的影响和鼓励代表性的多样性。这样可以使Aphora-phora 协议的排序能力降低到更分级。