While multilingual training is now an essential ingredient in machine translation (MT) systems, recent work has demonstrated that it has different effects in different multilingual settings, such as many-to-one, one-to-many, and many-to-many learning. These training settings expose the encoder and the decoder in a machine translation model with different data distributions. In this paper, we examine how different varieties of multilingual training contribute to learning these two components of the MT model. Specifically, we compare bilingual models with encoders and/or decoders initialized by multilingual training. We show that multilingual training is beneficial to encoders in general, while it only benefits decoders for low-resource languages (LRLs). We further find the important attention heads for each language pair and compare their correlations during inference. Our analysis sheds light on how multilingual translation models work and also enables us to propose methods to improve performance by training with highly related languages. Our many-to-one models for high-resource languages and one-to-many models for LRL outperform the best results reported by Aharoni et al. (2019).
翻译:虽然多语种培训现在已成为机器翻译系统的一个基本要素,但最近的工作表明,它在不同多语种环境中,例如多到一、一到多到多、多到多的学习中,具有不同的影响。这些培训环境暴露了编码器和解码器在机器翻译模型中,数据分布不同。在本文件中,我们审视了多种语言培训的不同品种如何有助于学习MT模式的这两个组成部分。具体地说,我们比较双语模式与通过多语种培训启动的编码器和(或)解码器。我们显示,多语种培训一般有利于编码器,而只有利于低资源语言的解码器(LLLLs)。我们进一步发现每种语言的重要关注头,并在引证过程中比较它们之间的关系。我们的分析揭示了多语种翻译模型如何发挥作用,并使我们能够提出方法,通过培训使用高度相关的语言来改进业绩。我们高资源语言的多种到一个模式和LRL的一到一个元模型比Aharoni等人(2019年)所报告的最佳结果要好。