来源家庭与选择家庭:严重低资源机器翻译大规模平行平行超常使用技术预培训 (Family of Origin and Family of Choice: Massively Parallel Lexiconized Iterative Pretraining for Severely Low Resource Machine Translation)

We translate a closed text that is known in advance into a severely low resource language by leveraging massive source parallelism. In other words, given a text in 124 source languages, we translate it into a severely low resource language using only ~1,000 lines of low resource data without any external help. Firstly, we propose a systematic method to rank and choose source languages that are close to the low resource language. We call the linguistic definition of language family Family of Origin (FAMO), and we call the empirical definition of higher-ranked languages using our metrics Family of Choice (FAMC). Secondly, we build an Iteratively Pretrained Multilingual Order-preserving Lexiconized Transformer (IPML) to train on ~1,000 lines (~3.5%) of low resource data. To translate named entities correctly, we build a massive lexicon table for 2,939 Bible named entities in 124 source languages, and include many that occur once and covers more than 66 severely low resource languages. Moreover, we also build a novel method of combining translations from different source languages into one. Using English as a hypothetical low resource language, we get a +23.9 BLEU increase over a multilingual baseline, and a +10.3 BLEU increase over our asymmetric baseline in the Bible dataset. We get a 42.8 BLEU score for Portuguese-English translation on the medical EMEA dataset. We also have good results for a real severely low resource Mayan language, Eastern Pokomchi.

翻译：我们通过利用大量源平行论,将一个已知的封闭文本提前翻译成非常低的资源语言。换句话说,根据一个124种源语言的文本,我们将它翻译成一个非常低的资源语言,在没有任何外部帮助的情况下,仅使用~1,000条低资源数据线;首先,我们提出一个系统的方法,对接近低资源语言的源语言进行排序和选择源语言;我们称语言家庭起源家庭的语言定义(FAMO)为语言定义,我们用我们的标准语言选择家庭(FAMC)将高层次语言的经验定义翻译成一个非常低的资源语言。第二,我们用一种假定的低资源语言,即预先培训多语言秩序保存变异器(IPML),在~1,000条(~3.5 % ) 低资源数据线上进行训练。为了正确翻译,我们用124种源语言为圣经命名的2,939个实体建立一个大规模词汇表,包括一次出现的许多语言,并涵盖超过66种严重低资源语言的语言。此外,我们还建立了一种将不同源语言译为一种新型的医学方法。我们用一种假设的低资源语言,我们用一种+23.9 BLEUA 的英语,我们用一种高语言在BA 的英语上增加了一个在线数据基比BBB的BB的B的B的B级数据基点的比B级的B级的B级数据。