With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. However, to date there has been no systematic analysis of the quality of these publicly available datasets, or whether the datasets actually contain content in the languages they claim to represent. In this work, we manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4), and audit the correctness of language codes in a sixth (JW300). We find that lower-resource corpora have systematic issues: at least 15 corpora are completely erroneous, and a significant fraction contains less than 50% sentences of acceptable quality. Similarly, we find 82 corpora that are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses. Inspired by our analysis, we recommend techniques to evaluate and improve multilingual corpora and discuss the risks that come with low-quality data releases.
翻译:近年来,随着在自然语言处理(NLP)方面的大规模预先培训和多语种模拟的成功,近年来出现了涵盖数百种语言的大型网上文字数据集激增的情况,然而,迄今为止,尚未对这些公开的数据集的质量进行系统分析,或对数据集是否实际包含其声称所代表的语言内容进行系统分析,或者数据集是否实际包含其声称所代表的语言的内容。在这项工作中,我们人工审计了205个特定语言公司的质量,这些公司有5个主要公共数据集(CC Araxl、paraClawl、WikiMatrix、OSCAR、MCA、MC4),并审计了第6个语言代码的正确性(JW300),我们发现,资源较低的公司存在系统性问题:至少15个公司完全错误,相当一部分公司含有低于50%的可接受质量的句子。同样,我们发现82个公司存在标签错误或使用非标准/模糊的语言代码。我们证明,这些问题即使对相关语言的非发言人来说也很容易发现,并且用自动分析来补充人类判断的判断。我们建议以低质量的数据风险来改进数据。