Neural Machine Translation (NMT) models achieve state-of-the-art performance on many translation benchmarks. As an active research field in NMT, knowledge distillation is widely applied to enhance the model's performance by transferring teacher model's knowledge on each training sample. However, previous work rarely discusses the different impacts and connections among these samples, which serve as the medium for transferring teacher knowledge. In this paper, we design a novel protocol that can effectively analyze the different impacts of samples by comparing various samples' partitions. Based on above protocol, we conduct extensive experiments and find that the teacher's knowledge is not the more, the better. Knowledge over specific samples may even hurt the whole performance of knowledge distillation. Finally, to address these issues, we propose two simple yet effective strategies, i.e., batch-level and global-level selections, to pick suitable samples for distillation. We evaluate our approaches on two large-scale machine translation tasks, WMT'14 English->German and WMT'19 Chinese->English. Experimental results show that our approaches yield up to +1.28 and +0.89 BLEU points improvements over the Transformer baseline, respectively.
翻译:在许多翻译基准方面,作为NMT的一个积极研究领域,知识蒸馏被广泛应用,通过传授教师模型对每个培训样本的知识来提高模型的绩效。然而,以前的工作很少讨论这些样本的不同影响和联系,这些样本是传授教师知识的媒介。在本文中,我们设计了一个新颖的协议,通过比较各种样本分区,能够有效地分析样本的不同影响。根据上述协议,我们进行了广泛的实验,发现教师的知识不是越多越好。对具体样本的知识甚至会损害整个知识蒸馏的绩效。最后,为了解决这些问题,我们提出了两个简单而有效的战略,即批量和全球一级的选择,以采集适当的样本供蒸馏。我们评估了我们关于两个大规模机器翻译任务的方法,WMT'14英语>德语和WMT'19中文>英语。实验结果表明,我们的方法将分别达到+1.28和+0.89 BLEU的升级基准点。