We conduct an empirical study of neural machine translation (NMT) for truly low-resource languages, and propose a training curriculum fit for cases when both parallel training data and compute resource are lacking, reflecting the reality of most of the world's languages and the researchers working on these languages. Previously, unsupervised NMT, which employs back-translation (BT) and auto-encoding (AE) tasks has been shown barren for low-resource languages. We demonstrate that leveraging comparable data and code-switching as weak supervision, combined with BT and AE objectives, result in remarkable improvements for low-resource languages even when using only modest compute resources. The training curriculum proposed in this work achieves BLEU scores that improve over supervised NMT trained on the same backbone architecture by +12.2 BLEU for English to Gujarati and +3.7 BLEU for English to Kazakh, showcasing the potential of weakly-supervised NMT for the low-resource languages. When trained on supervised data, our training curriculum achieves a new state-of-the-art result on the Somali dataset (BLEU of 29.3 for Somali to English). We also observe that adding more time and GPUs to training can further improve performance, which underscores the importance of reporting compute resource usage in MT research.
翻译:我们对真正低资源语言的神经机机翻译(NMT)进行了实证研究,并提出了适合缺乏平行培训数据和计算资源的案例的培训课程,反映了世界大多数语言的现实和研究这些语言的研究人员的现实。以前,未经监督的NMT使用回译(BT)和自动编码(AE)任务已经证明对低资源语言来说是徒劳无益的。我们证明,利用可比数据和代码转换作为薄弱监管,加上BT和AE目标,即使只使用少量的计算资源,也给低资源语言带来显著的改进。 这项工作中拟议的培训课程取得了BLEU分数,通过+12.2英语至古吉拉特的BLEU,以及英语至哈萨克的+3.7 BLEU,改进了低资源语言受监管的NMT的潜力。在接受监督数据培训时,我们的培训课程取得了新的水平低资源语言改进,即使只使用少量的计算资源。