We present a parallel machine translation training corpus for English and Akuapem Twi of 25,421 sentence pairs. We used a transformer-based translator to generate initial translations in Akuapem Twi, which were later verified and corrected where necessary by native speakers to eliminate any occurrence of translationese. In addition, 697 higher quality crowd-sourced sentences are provided for use as an evaluation set for downstream Natural Language Processing (NLP) tasks. The typical use case for the larger human-verified dataset is for further training of machine translation models in Akuapem Twi. The higher quality 697 crowd-sourced dataset is recommended as a testing dataset for machine translation of English to Twi and Twi to English models. Furthermore, the Twi part of the crowd-sourced data may also be used for other tasks, such as representation learning, classification, etc. We fine-tune the transformer translation model on the training corpus and report benchmarks on the crowd-sourced test set.
翻译:我们为英语和Akuapem Twi提供了25,421对判刑的平行机器翻译培训,我们使用一个基于变压器的笔译员制作了Akuapem Twi的初次翻译,后来当地语者进行了必要的核实和纠正,以消除任何翻译的发生;此外,还为下游自然语言处理(NLP)任务提供了697个质量更高的众源判决,作为下游自然语言处理(NLP)任务的评价集;大型人类核查数据集的典型使用案例是在Akuapem Twi进一步培训机器翻译模型。推荐质量更高的697个众源数据集作为将英语机器翻译到Twi和Twi到英语模型的测试数据集;此外,众源数据中的Twi部分也可以用于其他任务,例如代表性学习、分类等。我们微调了培训教材上的变压器翻译模型和众源测试集的报告基准。