非洲MT:8种非洲语言翻译培训前战略和可复制基准 (AfroMT: Pretraining Strategies and Reproducible Benchmarks for Translation of 8 African Languages)

Reproducible benchmarks are crucial in driving progress of machine translation research. However, existing machine translation benchmarks have been mostly limited to high-resource or well-represented languages. Despite an increasing interest in low-resource machine translation, there are no standardized reproducible benchmarks for many African languages, many of which are used by millions of speakers but have less digitized textual data. To tackle these challenges, we propose AfroMT, a standardized, clean, and reproducible machine translation benchmark for eight widely spoken African languages. We also develop a suite of analysis tools for system diagnosis taking into account the unique properties of these languages. Furthermore, we explore the newly considered case of low-resource focused pretraining and develop two novel data augmentation-based strategies, leveraging word-level alignment information and pseudo-monolingual data for pretraining multilingual sequence-to-sequence models. We demonstrate significant improvements when pretraining on 11 languages, with gains of up to 2 BLEU points over strong baselines. We also show gains of up to 12 BLEU points over cross-lingual transfer baselines in data-constrained scenarios. All code and pretrained models will be released as further steps towards larger reproducible benchmarks for African languages.

翻译：但是,现有的机器翻译基准大多限于高资源语言或代表性强的语言。尽管对低资源机器翻译的兴趣日益浓厚,但许多非洲语言没有标准化的复制基准,其中许多语言为数百万人使用,但文本数据数字化程度较低。为了应对这些挑战,我们提议为八种广泛使用的非洲语言建立一个标准化、清洁和可复制的机器翻译基准AfroMT,这是一个标准化、清洁和可复制的非洲语言基准。我们还开发了一套系统诊断分析工具,同时考虑到这些语言的独特性。此外,我们探索了新考虑的低资源集中的预培训案例,并开发了两个新的数据增强战略,利用了字级协调信息和假冒语言数据对多语种序列到后继模型进行预培训。我们展示了在11种语言培训前取得的重大改进,在强大的基线上取得了高达2个BLEU点的收益。我们还显示,在数据限制情景下跨语言传输基线方面,已经取得了多达12个BLEU值的成绩。所有代码和预设模式都将发布,作为非洲语言更大规模基准的进一步步骤。

相关内容

Machine Translation

关注 209

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

【ACL2020】Span-ConveRT：预训练对话表示小样本跨度提取，Span-ConveRT: Few-shot Span Extraction for Dialog with Pretrained Conversational Representations

专知会员服务

17+阅读 · 2020年5月19日

多语言神经机器翻译综述论文，34页pdf，A Comprehensive Survey of Multilingual Neural Machine Translation

专知会员服务

19+阅读 · 2020年4月25日

【Google】无监督机器翻译，Unsupervised Machine Translation

专知会员服务

36+阅读 · 2020年3月3日