英文-两文机器翻译平行体 (English-Twi Parallel Corpus for Machine Translation)

Paul Azunre,Salomey Osei,Salomey Addo,Lawrence Asamoah Adu-Gyamfi,Stephen Moore,Bernard Adabankah,Bernard Opoku,Clara Asare-Nyarko,Samuel Nyarko,Cynthia Amoaba,Esther Dansoa Appiah,Felix Akwerh,Richard Nii Lante Lawson,Joel Budu,Emmanuel Debrah,Nana Boateng,Wisdom Ofori,Edwin Buabeng-Munkoh,Franklin Adjei,Isaac Kojo Essel Ampomah,Joseph Otoo,Reindorf Borkor,Standylove Birago Mensah,Lucien Mensah,Mark Amoako Marcel,Anokye Acheampong Amponsah,James Ben Hayfron-Acquah

from arxiv, 9 pages paper, Accepted at African NLP workshop @EACL 2021

We present a parallel machine translation training corpus for English and Akuapem Twi of 25,421 sentence pairs. We used a transformer-based translator to generate initial translations in Akuapem Twi, which were later verified and corrected where necessary by native speakers to eliminate any occurrence of translationese. In addition, 697 higher quality crowd-sourced sentences are provided for use as an evaluation set for downstream Natural Language Processing (NLP) tasks. The typical use case for the larger human-verified dataset is for further training of machine translation models in Akuapem Twi. The higher quality 697 crowd-sourced dataset is recommended as a testing dataset for machine translation of English to Twi and Twi to English models. Furthermore, the Twi part of the crowd-sourced data may also be used for other tasks, such as representation learning, classification, etc. We fine-tune the transformer translation model on the training corpus and report benchmarks on the crowd-sourced test set.

翻译：我们为英语和Akuapem Twi提供了25,421对判刑的平行机器翻译培训,我们使用一个基于变压器的笔译员制作了Akuapem Twi的初次翻译,后来当地语者进行了必要的核实和纠正,以消除任何翻译的发生;此外,还为下游自然语言处理(NLP)任务提供了697个质量更高的众源判决,作为下游自然语言处理(NLP)任务的评价集;大型人类核查数据集的典型使用案例是在Akuapem Twi进一步培训机器翻译模型。推荐质量更高的697个众源数据集作为将英语机器翻译到Twi和Twi到英语模型的测试数据集;此外,众源数据中的Twi部分也可以用于其他任务,例如代表性学习、分类等。我们微调了培训教材上的变压器翻译模型和众源测试集的报告基准。

相关内容

Machine Translation

关注 209

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

【Facebook AI】无监督机器翻译，336页ppt，Unsupervised Machine Translation

专知会员服务

19+阅读 · 2020年11月17日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

多语言神经机器翻译综述论文，34页pdf，A Comprehensive Survey of Multilingual Neural Machine Translation

专知会员服务

19+阅读 · 2020年4月25日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日