Low-resource African languages have not fully benefited from the progress in neural machine translation because of a lack of data. Motivated by this challenge we compare zero-shot learning, transfer learning and multilingual learning on three Bantu languages (Shona, isiXhosa and isiZulu) and English. Our main target is English-to-isiZulu translation for which we have just 30,000 sentence pairs, 28% of the average size of our other corpora. We show the importance of language similarity on the performance of English-to-isiZulu transfer learning based on English-to-isiXhosa and English-to-Shona parent models whose BLEU scores differ by 5.2. We then demonstrate that multilingual learning surpasses both transfer learning and zero-shot learning on our dataset, with BLEU score improvements relative to the baseline English-to-isiZulu model of 9.9, 6.1 and 2.0 respectively. Our best model also improves the previous SOTA BLEU score by more than 10.
翻译:由于缺乏数据,非洲低资源语言没有从神经机翻译的进展中充分获益。我们以这一挑战为动力,比较了三种班图语(Shona、IsiXhosa和IsiZulu)和英语(Shona、IsiXhosa和IsiZulu)的零速学习、转移学习和多语言学习。我们的主要目标是英语到isiZulu翻译,我们只有3万对,占我们其他公司平均规模的28%。我们显示了在英语到isiXhosa和英语到Shona家长模式的学习成绩方面语言相似的重要性,这些模式的BLEU得分与我们的数据集的英文到isiZulu基本模式(分别为9.9、6.1和2.0)相比,我们的多语言学习超过了转移学习和零点学习。我们的最佳模式还改善了以前的SOTA BLEU得分超过10。