In this paper, we describe MorisienMT, a dataset for benchmarking machine translation quality of Mauritian Creole. Mauritian Creole (Morisien) is the lingua franca of the Republic of Mauritius and is a French-based creole language. MorisienMT consists of a parallel corpus between English and Morisien, French and Morisien and a monolingual corpus for Morisien. We first give an overview of Morisien and then describe the steps taken to create the corpora and, from it, the training and evaluation splits. Thereafter, we establish a variety of baseline models using the created parallel corpora as well as large French--English corpora for transfer learning. We release our datasets publicly for research purposes and hope that this spurs research for Morisien machine translation.
翻译:在本文中,我们描述了毛里求斯克里奥尔语(Morisien)基准机器翻译质量的数据集MorisienMIT。毛里求斯克里奥尔语(Morisien)是毛里求斯共和国的通用语,是法语的克里奥尔语。莫里西南特语包括英语和莫里西安语、法语和莫里西安语的平行材料,还有莫里西安语的单语材料。我们首先概述莫里西安语,然后描述为创建公司所采取的步骤,然后从中介绍培训和评估。随后,我们利用所创建的平行公司以及大型法语-英语公司建立各种基线模型,用于转移学习。我们公开发布我们的数据集,用于研究目的,并希望这能激发对莫里西安机器翻译的研究。