GPT-2 and BERT demonstrate the effectiveness of using pre-trained language models (LMs) on various natural language processing tasks. However, LM fine-tuning often suffers from catastrophic forgetting when applied to resource-rich tasks. In this work, we introduce a concerted training framework (\method) that is the key to integrate the pre-trained LMs to neural machine translation (NMT). Our proposed Cnmt consists of three techniques: a) asymptotic distillation to ensure that the NMT model can retain the previous pre-trained knowledge; b) a dynamic switching gate to avoid catastrophic forgetting of pre-trained knowledge; and c) a strategy to adjust the learning paces according to a scheduled policy. Our experiments in machine translation show \method gains of up to 3 BLEU score on the WMT14 English-German language pair which even surpasses the previous state-of-the-art pre-training aided NMT by 1.4 BLEU score. While for the large WMT14 English-French task with 40 millions of sentence-pairs, our base model still significantly improves upon the state-of-the-art Transformer big model by more than 1 BLEU score.
翻译:GPT-2和BERT展示了在各种自然语言处理任务中使用预先培训语言模式(LMS)的有效性。然而,LM微调往往在应用到资源丰富的任务时被灾难性地遗忘。在这项工作中,我们引入了一个协调一致的培训框架(method),这是将经过培训的LMS纳入神经机翻译的关键。我们提议的Cnmt由三种技术组成:(a) 无症状的蒸馏,以确保NMT模型能够保留以前经过培训的知识;(b) 动态转换大门,以避免灾难性地忘记事先培训的知识;(c) 调整学习速度的战略。我们在机器翻译方面的实验显示,WMT14英语-德语配对的BLEU得分高达3个,甚至超过了以前的最先进的培训前NMT,比1.4 BLEU得分高出。对于有4 000万个判决版的大型WMT14英语-法语任务,我们的基础模型仍然大大改进了比BEU1级高的成绩。