Knowledge distillation (KD) is essential for training non-autoregressive translation (NAT) models by reducing the complexity of the raw data with an autoregressive teacher model. In this study, we empirically show that as a side effect of this training, the lexical choice errors on low-frequency words are propagated to the NAT model from the teacher model. To alleviate this problem, we propose to expose the raw data to NAT models to restore the useful information of low-frequency words, which are missed in the distilled data. To this end, we introduce an extra Kullback-Leibler divergence term derived by comparing the lexical choice of NAT model and that embedded in the raw data. Experimental results across language pairs and model architectures demonstrate the effectiveness and universality of the proposed approach. Extensive analyses confirm our claim that our approach improves performance by reducing the lexical choice errors on low-frequency words. Encouragingly, our approach pushes the SOTA NAT performance on the WMT14 English-German and WMT16 Romanian-English datasets up to 27.8 and 33.8 BLEU points, respectively. The source code will be released.
翻译:知识蒸馏(KD)对于以自动递减教师模式培训非自动递增翻译(NAT)模型至关重要,通过降低原始数据的复杂性来培训非自动递减式翻译(NAT)模型。在这项研究中,我们从经验上表明,作为这一培训的副作用,低频单词的词汇选择错误从教师模式中传播到NAT模型。为了缓解这一问题,我们提议将原始数据暴露在NAT模型中,以恢复低频单词的有用信息,而低频单词在蒸馏数据中被遗漏了。为此,我们引入了一个额外的 Kullback-Leeper差异术语,通过比较NAT模型的词汇选择和原始数据中嵌入该术语。跨语言对口和模型结构的实验结果显示了拟议方法的有效性和普遍性。广泛的分析证实了我们的说法,即我们的方法通过减少低频单词的词汇选择错误提高了绩效。令人鼓舞的是,我们的方法将SOTA NAT在WT14 英德语和WMT16 罗马尼亚-英语数据集上的表现推至27.8和33.8 BLEU分别发布源代码。