Document-level neural machine translation (DocNMT) achieves coherent translations by incorporating cross-sentence context. However, for most language pairs there's a shortage of parallel documents, although parallel sentences are readily available. In this paper, we study whether and how contextual modeling in DocNMT is transferable via multilingual modeling. We focus on the scenario of zero-shot transfer from teacher languages with document level data to student languages with no documents but sentence level data, and for the first time treat document-level translation as a transfer learning problem. Using simple concatenation-based DocNMT, we explore the effect of 3 factors on the transfer: the number of teacher languages with document level data, the balance between document and sentence level data at training, and the data condition of parallel documents (genuine vs. backtranslated). Our experiments on Europarl-7 and IWSLT-10 show the feasibility of multilingual transfer for DocNMT, particularly on document-specific metrics. We observe that more teacher languages and adequate data balance both contribute to better transfer quality. Surprisingly, the transfer is less sensitive to the data condition, where multilingual DocNMT delivers decent performance with either backtranslated or genuine document pairs.
翻译:文档级神经机翻译(DocNMT)通过纳入交叉感应背景实现了一致翻译。然而,对于大多数对口语言来说,平行文件短缺,尽管可以随时提供平行的句子。在本文中,我们研究DocNMT中环境建模是否以及如何通过多语种建模可转让。我们侧重于将带有文件级数据的教师语言零发传输给没有文件但有判决级数据的学生语言的情景,并首次将文件级翻译作为传输学习问题处理。我们使用简单的同级化DocNMT,我们探索了3个因素对传输的影响:拥有文件级数据的教师语言数量、培训中文件和判决级数据之间的平衡以及平行文件的数据状况(reality vs back翻译)。我们关于Eurparl-7和IWSLT-10的实验表明,为DocNMT提供多语种语言传输的可行性,特别是文件特定指标。我们发现,更多的教师语言和适当的数据平衡都有助于提高传输质量。奇怪的是,转移对数据状况没有那么敏感,因为多语种DocNMT的文档或真正的背文件。