Although teacher forcing has become the main training paradigm for neural machine translation, it usually makes predictions only conditioned on past information, and hence lacks global planning for the future. To address this problem, we introduce another decoder, called seer decoder, into the encoder-decoder framework during training, which involves future information in target predictions. Meanwhile, we force the conventional decoder to simulate the behaviors of the seer decoder via knowledge distillation. In this way, at test the conventional decoder can perform like the seer decoder without the attendance of it. Experiment results on the Chinese-English, English-German and English-Romanian translation tasks show our method can outperform competitive baselines significantly and achieves greater improvements on the bigger data sets. Besides, the experiments also prove knowledge distillation the best way to transfer knowledge from the seer decoder to the conventional decoder compared to adversarial learning and L2 regularization.
翻译:虽然教师强迫已成为神经机翻译的主要培训模式,但通常只能以过去的信息为条件进行预测,因此缺乏对未来的全球规划。为了解决这一问题,我们在培训过程中将另一个叫做“种子解码器”的解码器引入了编码器框架,这涉及到目标预测中的未来信息。与此同时,我们强迫传统解码器通过知识蒸馏来模拟见解码器的行为。这样,常规解码器在测试时可以像见解码器一样运作,而没有它参加。中文-英文、英文-德文和英文-罗马尼亚文翻译任务的实验结果表明,我们的方法可以大大超过竞争性基线,并在更大的数据集上实现更大的改进。此外,实验还证明,知识蒸发了将知识从见解码器转移到传统解码器上的最佳方法,与对抗性学习和L2正规化相比,从见解码器转移到传统解码器的最佳方法。