The scarcity of parallel data is a major obstacle for training high-quality machine translation systems for low-resource languages. Fortunately, some low-resource languages are linguistically related or similar to high-resource languages; these related languages may share many lexical or syntactic structures. In this work, we exploit this linguistic overlap to facilitate translating to and from a low-resource language with only monolingual data, in addition to any parallel data in the related high-resource language. Our method, NMT-Adapt, combines denoising autoencoding, back-translation and adversarial objectives to utilize monolingual data for low-resource adaptation. We experiment on 7 languages from three different language families and show that our technique significantly improves translation into low-resource language compared to other translation baselines.
翻译:缺乏平行数据是培训低资源语言高质量机器翻译系统的主要障碍,幸运的是,一些低资源语言与高资源语言有语言联系或类似的语言;这些相关语言可能共享许多词汇结构或合成结构。在这项工作中,我们利用这种语言重叠,除了相关高资源语言中的任何平行数据外,还利用只使用单一语言翻译和从仅使用单一语言的低资源语言翻译数据。我们的方法,即NMT-Adapt, 结合解密自动编码、回译和对抗性目标,利用单一语言数据进行低资源适应。我们实验了三个不同语言家庭的7种语言,并表明我们的技术大大改进了将低资源语言翻译到其他翻译基线的工作。