Multilingual transfer techniques often improve low-resource machine translation (MT). Many of these techniques are applied without considering data characteristics. We show in the context of Haitian-to-English translation that transfer effectiveness is correlated with amount of training data and relationships between knowledge-sharing languages. Our experiments suggest that for some languages beyond a threshold of authentic data, back-translation augmentation methods are counterproductive, while cross-lingual transfer from a sufficiently related language is preferred. We complement this finding by contributing a rule-based French-Haitian orthographic and syntactic engine and a novel method for phonological embedding. When used with multilingual techniques, orthographic transformation makes statistically significant improvements over conventional methods. And in very low-resource Jamaican MT, code-switching with a transfer language for orthographic resemblance yields a 6.63 BLEU point advantage.
翻译:多种语文的转让技术往往改进了低资源机器翻译(MT)。许多这些技术的应用没有考虑到数据特点。我们在海地文到英文的翻译中显示,转让的有效性与培训数据的数量和知识分享语言之间的关系相关。我们的实验表明,对于超过真实数据门槛的某些语言而言,反译增殖法是适得其反的,而偏向于从一种充分相关语言进行跨语言的转让。我们通过提供一种基于规则的法语-海地文的拼写和合成引擎以及一种新颖的听觉嵌入方法来补充这一发现。在使用多种语言技术时,方言转换在统计上比传统方法有显著的改进。在极低资源牙买加文的MT中,用一种传译语言进行编码转换以产生正近语言的6.63 BLEU点优势。