With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.
翻译:近年来,随着在自然语言处理(NLP)方面的大规模预先培训和多语种建模工作的成功,近年来出现了涵盖数百种语言的大型网上文字数据集的激增。我们人工审核了以五大公共数据集(CC Commission、ParaCrawl、WikiMatrix、OSCAR、MC4)发布的205个语言专用公司的质量。低资源公司存在系统性问题:至少15个公司没有可用的文本,相当一部分公司含有低于50%的可接受质量的句子。此外,许多公司存在错误标签或使用非标准/模糊的语言代码。我们证明,即使对非熟练的演讲者来说,这些问题也很容易发现,并且用自动分析来补充人类审计。最后,我们建议评估和改进多语种公司的技术,并讨论低质量数据发布的潜在风险。