Monolingual data have been demonstrated to be helpful in improving translation quality of both statistical machine translation (SMT) systems and neural machine translation (NMT) systems, especially in resource-poor or domain adaptation tasks where parallel data are not rich enough. In this paper, we propose a novel approach to better leveraging monolingual data for neural machine translation by jointly learning source-to-target and target-to-source NMT models for a language pair with a joint EM optimization method. The training process starts with two initial NMT models pre-trained on parallel data for each direction, and these two models are iteratively updated by incrementally decreasing translation losses on training data. In each iteration step, both NMT models are first used to translate monolingual data from one language to the other, forming pseudo-training data of the other NMT model. Then two new NMT models are learnt from parallel data together with the pseudo training data. Both NMT models are expected to be improved and better pseudo-training data can be generated in next step. Experiment results on Chinese-English and English-German translation tasks show that our approach can simultaneously improve translation quality of source-to-target and target-to-source models, significantly outperforming strong baseline systems which are enhanced with monolingual data for model training including back-translation.
翻译:事实证明,单语数据有助于改善统计机器翻译系统和神经机器翻译系统的翻译质量,特别是在资源贫乏或领域适应任务不够丰富的平行数据方面。在本文件中,我们建议采取新颖办法,通过联合学习源对目标和标对源国家MT模型来更好地利用单一语言数据,用于神经机器翻译,同时采用联合EM优化方法,对一种语言进行语言配对,采用目标对源和标对源国家MT模型。培训过程从两个初步国家MT模型开始,为每个方向的平行数据预先培训,这两个模型通过逐步减少培训数据方面的翻译损失进行迭代更新。在每一个迭代步骤中,两种NMT模型首先用于将单语数据从一种语言翻译到另一种语言,形成其他NMT模型的假培训数据。然后,两个新的NMT模型与假培训数据一起从平行数据学习。两种国家MT模型预计将得到改进,在下一个步骤中可以产生更好的伪培训数据。 中文和英文翻译和德文翻译的实验结果显示,我们的方法可以同时改进源翻译质量,包括强化的双向级培训基准。