MENYO-20k:一个用于机器翻译和领域适应的多域英文YorObá公司 (MENYO-20k: A Multi-domain English-Yorùbá Corpus for Machine Translation and Domain Adaptation)

Massively multilingual machine translation (MT) has shown impressive capabilities, including zero and few-shot translation between low-resource language pairs. However, these models are often evaluated on high-resource languages with the assumption that they generalize to low-resource ones. The difficulty of evaluating MT models on low-resource pairs is often due the lack of standardized evaluation datasets. In this paper, we present MENYO-20k, the first multi-domain parallel corpus for the low-resource Yor\`ub\'a--English (yo--en) language pair with standardized train-test splits for benchmarking. We provide several neural MT (NMT) benchmarks on this dataset and compare to the performance of popular pre-trained (massively multilingual) MT models, showing that, in almost all cases, our simple benchmarks outperform the pre-trained MT models. A major gain of BLEU $+9.9$ and $+8.6$ (en2yo) is achieved in comparison to Facebook's M2M-100 and Google multilingual NMT respectively when we use MENYO-20k to fine-tune generic models.

翻译：大量多语种机器翻译(MT)显示出令人印象深刻的能力,包括低资源语言对口之间零翻译和少见翻译。然而,这些模型往往在高资源语言上进行评估,假设这些模型一般为低资源语言。评估低资源对口的MT模型的困难往往是由于缺乏标准化的评价数据集。在本文件中,我们介绍了低资源Yor ⁇ ub\'a-English(Yo-en)语言(Yo-en)的第一对多域平行文件(MNYO-20k),并配以标准化的培训测试分解。我们对这一数据集提供了几个神经MT(NMT)基准,并与流行的预先培训(多语种)MT模型的性能进行比较。我们使用MENYO-20k到微调通用模型时,几乎在所有情况下,我们简单的基准都超过了预先培训的MTM模型。BLEU+9.9美元和$+8.6美元(en2yo)的重大收益与FacebookM2M-100和谷多语言NMT分别实现。

相关内容

Machine Translation

关注 209

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

【Facebook AI】无监督机器翻译，336页ppt，Unsupervised Machine Translation

专知会员服务

19+阅读 · 2020年11月17日