《一一一一一质量:对网上多语文多语文数据集的审计》 (Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets)

Isaac Caswell,Julia Kreutzer,Lisa Wang,Ahsan Wahab,Daan van Esch,Nasanbayar Ulzii-Orshikh,Allahsera Tapo,Nishant Subramani,Artem Sokolov,Claytone Sikasote,Monang Setyawan,Supheakmungkol Sarin,Sokhar Samb,Benoît Sagot,Clara Rivera,Annette Rios,Isabel Papadimitriou,Salomey Osei,Pedro Javier Ortiz Suárez,Iroro Orife,Kelechi Ogueji,Rubungo Andre Niyongabo,Toan Q. Nguyen,Mathias Müller,André Müller,Shamsuddeen Hassan Muhammad,Nanda Muhammad,Ayanda Mnyakeni,Jamshidbek Mirzakhalov,Tapiwanashe Matangira,Colin Leong,Nze Lawson,Sneha Kudugunta,Yacine Jernite,Mathias Jenny,Orhan Firat,Bonaventure F. P. Dossou,Sakhile Dlamini,Nisansa de Silva,Sakine Çabuk Ballı,Stella Biderman,Alessia Battisti,Ahmed Baruwa,Ankur Bapna,Pallavi Baljekar,Israel Abebe Azime,Ayodele Awokoya,Duygu Ataman,Orevaoghene Ahia,Oghenefego Ahia,Sweta Agrawal,Mofetoluwa Adeyemi

from arxiv, 10 pages paper; 10 pages appendix; AfricaNLP 2021

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. However, to date there has been no systematic analysis of the quality of these publicly available datasets, or whether the datasets actually contain content in the languages they claim to represent. In this work, we manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4), and audit the correctness of language codes in a sixth (JW300). We find that lower-resource corpora have systematic issues: at least 15 corpora are completely erroneous, and a significant fraction contains less than 50% sentences of acceptable quality. Similarly, we find 82 corpora that are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses. Inspired by our analysis, we recommend techniques to evaluate and improve multilingual corpora and discuss the risks that come with low-quality data releases.

翻译：近年来,随着在自然语言处理(NLP)方面的大规模预先培训和多语种模拟的成功,近年来出现了涵盖数百种语言的大型网上文字数据集激增的情况,然而,迄今为止,尚未对这些公开的数据集的质量进行系统分析,或对数据集是否实际包含其声称所代表的语言内容进行系统分析,或者数据集是否实际包含其声称所代表的语言的内容。在这项工作中,我们人工审计了205个特定语言公司的质量,这些公司有5个主要公共数据集(CC Araxl、paraClawl、WikiMatrix、OSCAR、MCA、MC4),并审计了第6个语言代码的正确性(JW300),我们发现,资源较低的公司存在系统性问题:至少15个公司完全错误,相当一部分公司含有低于50%的可接受质量的句子。同样,我们发现82个公司存在标签错误或使用非标准/模糊的语言代码。我们证明,这些问题即使对相关语言的非发言人来说也很容易发现,并且用自动分析来补充人类判断的判断。我们建议以低质量的数据风险来改进数据。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【MIT】大型元学习数据集（Supplementary Materials for Niseko: a Large-ScaleMeta-Learning Dataset），麻省理工学院博士| Zeyuan Shang

专知会员服务

15+阅读 · 2019年12月24日