Machine translation (MT) involving Indigenous languages, including those possibly endangered, is challenging due to lack of sufficient parallel data. We describe an approach exploiting bilingual and multilingual pretrained MT models in a transfer learning setting to translate from Spanish to ten South American Indigenous languages. Our models set new SOTA on five out of the ten language pairs we consider, even doubling performance on one of these five pairs. Unlike previous SOTA that perform data augmentation to enlarge the train sets, we retain the low-resource setting to test the effectiveness of our models under such a constraint. In spite of the rarity of linguistic information available about the Indigenous languages, we offer a number of quantitative and qualitative analyses (e.g., as to morphology, tokenization, and orthography) to contextualize our results.
翻译:由于缺乏足够的平行数据,涉及土著语言(包括可能濒危语言)的机器翻译(MT)具有挑战性。我们描述了一种在从西班牙语到南美10种土著语言的传输学习环境中利用双语和多语言预先培训的MT模式的方法。我们的模型为我们考虑的10对语言中的5对设定了新的SOTA,甚至将这5对语言中的1对的功能翻一番。与以前为扩大火车组而进行数据扩增的SOTA不同的是,我们保留了低资源环境,在这种限制下检验我们模型的有效性。尽管关于土著语言的现有语言信息很少,但我们提供了一些定量和定性分析(例如形态学、象征性化和笔记),以将我们的结果背景化。