Efficiently and accurately translating a corpus into a low-resource language remains a challenge, regardless of the strategies employed, whether manual, automated, or a combination of the two. Many Christian organizations are dedicated to the task of translating the Holy Bible into languages that lack a modern translation. Bible translation (BT) work is currently underway for over 3000 extremely low resource languages. We introduce the eBible corpus: a dataset containing 1009 translations of portions of the Bible with data in 833 different languages across 75 language families. In addition to a BT benchmarking dataset, we introduce model performance benchmarks built on the No Language Left Behind (NLLB) neural machine translation (NMT) models. Finally, we describe several problems specific to the domain of BT and consider how the established data and model benchmarks might be used for future translation efforts. For a BT task trained with NLLB, Austronesian and Trans-New Guinea language families achieve 35.1 and 31.6 BLEU scores respectively, which spurs future innovations for NMT for low-resource languages in Papua New Guinea.
翻译:无论采用手动、自动或两者结合的策略,高效准确地将语料库翻译成低资源语言仍然是一个挑战。许多基督教组织致力于将圣经翻译成缺乏现代翻译的语言。目前正在进行超过 3000 种极低资源语言的翻译工作。我们介绍 eBible 语料:一个包含 1009 份圣经部分翻译的数据集,其中包含来自 75 种语言系的 833 种不同语言的数据。除了 BT 基准数据集,我们还引入了基于“不留一种语言落后”的神经机器翻译(NMT)模型的模型性能基准。最后,我们描述了圣经翻译领域中特定的一些问题,并考虑了已建立的数据和模型基准如何用于未来的翻译工作。对于使用 NLLB 进行 BT 任务的训练,以琉球语系和新几内亚翻译语系分别达到 35.1 和 31.6 BLEU 分数,这促进了巴布亚新几内亚低资源语言的 NMT 的未来创新。