One of the biggest challenges hindering progress in low-resource and multilingual machine translation is the lack of good evaluation benchmarks. Current evaluation benchmarks either lack good coverage of low-resource languages, consider only restricted domains, or are low quality because they are constructed using semi-automatic procedures. In this work, we introduce the FLORES-101 evaluation benchmark, consisting of 3001 sentences extracted from English Wikipedia and covering a variety of different topics and domains. These sentences have been translated in 101 languages by professional translators through a carefully controlled process. The resulting dataset enables better assessment of model quality on the long tail of low-resource languages, including the evaluation of many-to-many multilingual translation systems, as all translations are multilingually aligned. By publicly releasing such a high-quality and high-coverage dataset, we hope to foster progress in the machine translation community and beyond.
翻译:妨碍在低资源和多语种机器翻译方面取得进展的最大挑战之一是缺乏良好的评价基准。目前的评价基准要么缺乏对低资源语言的良好覆盖,只考虑有限的领域,要么由于使用半自动程序建造,质量低。在这项工作中,我们引入了FLORES-101评价基准,其中包括从英文维基百科提取的3001个句子,涵盖不同的专题和领域。这些句子由专业笔译员通过仔细控制的程序以101种语言翻译。由此形成的数据集有助于更好地评估低资源语言长尾的模型质量,包括评价多种多语种翻译系统,因为所有翻译都是多语种的。通过公开发布高质量和高覆盖的数据集,我们希望在机器翻译界内外取得进展。