An ultimate language system aims at the high generalization and robustness when adapting to diverse scenarios. Unfortunately, the recent white hope pre-trained language models (PrLMs) barely escape from stacking excessive parameters to the over-parameterized Transformer architecture to achieve higher performances. This paper thus proposes \textit{Adversarial Self-Attention} mechanism (ASA), which adversarially reconstructs the Transformer attentions and facilitates model training from contaminated model structures, coupled with a fast and simple implementation for better PrLM building. We conduct comprehensive evaluation across a wide range of tasks on both pre-training and fine-tuning stages. For pre-training, ASA unfolds remarkable performance gain compared to regular training for longer periods. For fine-tuning, ASA-empowered models consistently outweigh naive models by a large margin considering both generalization and robustness.
翻译:最终语言系统的目的是在适应不同情景时高度概括性和稳健性。 不幸的是,最近的白人希望预培训语言模型(PrLMs)几乎无法从过多的参数堆叠到超参数化的变异器结构中,以达到更高的性能。 因此,本文件提议了“Textit{Adversarial Self-Anyatention}机制 ” (ASA ), 该机制对立地重建变异器的注意力,便利从被污染的模型结构中进行模型培训,同时为更好的PrLM大楼快速、简单地实施。 我们对培训前阶段和微调阶段的广泛任务进行了全面评估。 对于培训前阶段,ASA展示了与长期定期培训相比的显著业绩增益。 对于微调,ASA驱动的模型在一般化和强健度两个方面总是大大超过天真模型。