Unsupervised translation has reached impressive performance on resource-rich language pairs such as English-French and English-German. However, early studies have shown that in more realistic settings involving low-resource, rare languages, unsupervised translation performs poorly, achieving less than 3.0 BLEU. In this work, we show that multilinguality is critical to making unsupervised systems practical for low-resource settings. In particular, we present a single model for 5 low-resource languages (Gujarati, Kazakh, Nepali, Sinhala, and Turkish) to and from English directions, which leverages monolingual and auxiliary parallel data from other high-resource language pairs via a three-stage training scheme. We outperform all current state-of-the-art unsupervised baselines for these languages, achieving gains of up to 14.4 BLEU. Additionally, we outperform a large collection of supervised WMT submissions for various language pairs as well as match the performance of the current state-of-the-art supervised model for Nepali-English. We conduct a series of ablation studies to establish the robustness of our model under different degrees of data quality, as well as to analyze the factors which led to the superior performance of the proposed approach over traditional unsupervised models.
翻译:然而,早期研究表明,在涉及低资源、稀有语言、无监督翻译的更现实的环境中,低资源、稀有语言、无监督翻译效果不佳,低于3.0 BLEU。在这项工作中,我们表明,多语言性对于使不受监督的系统切实适用于低资源环境至关重要。特别是,我们为五种低资源语言(Gujarati、哈萨克、尼泊尔、僧伽罗和土耳其)提供了一种单一的模式,从英文到英文方向,利用其他高资源语言对口的单语和辅助平行数据。我们通过三阶段培训计划,超越了所有目前最先进的、不受监督的这些语言基线,取得了高达14.4 BLEU的收益。此外,我们比大量为各种语言对口提交受监督的WMT文件要差,也比照目前尼泊尔英语受监督的状态模式的绩效。我们进行了一系列的对比研究,以通过一系列三阶段培训计划,利用其他高资源语言对口的单语和辅助平行数据。我们超越了这些语言目前最先进的基线,完成了所有最新的、最先进的基线,实现了14.4 BLEUEU。此外,我们还完成了对传统质量的分析。