超越计算数据集以外的数据集:多语种数据集构建和必要资源调查 (Beyond Counting Datasets: A Survey of Multilingual Dataset Construction and Necessary Resources)

from arxiv, Accepted to Findings of EMNLP 2022. You can view our annotations, contribute to our survey, and view the analysis visualizations on our website at https://multilingual-dataset-survey.github.io

While the NLP community is generally aware of resource disparities among languages, we lack research that quantifies the extent and types of such disparity. Prior surveys estimating the availability of resources based on the number of datasets can be misleading as dataset quality varies: many datasets are automatically induced or translated from English data. To provide a more comprehensive picture of language resources, we examine the characteristics of 156 publicly available NLP datasets. We manually annotate how they are created, including input text and label sources and tools used to build them, and what they study, tasks they address and motivations for their creation. After quantifying the qualitative NLP resource gap across languages, we discuss how to improve data collection in low-resource languages. We survey language-proficient NLP researchers and crowd workers per language, finding that their estimated availability correlates with dataset availability. Through crowdsourcing experiments, we identify strategies for collecting high-quality multilingual data on the Mechanical Turk platform. We conclude by making macro and micro-level suggestions to the NLP community and individual researchers for future multilingual data development.

翻译：虽然国家语言方案社区普遍了解不同语言的资源差异,但我们缺乏对此类差异的程度和类型进行量化的研究; 先前根据数据集数量估计资源可用性的调查可能误导,因为数据集的质量各异:许多数据集是自动引出或由英文数据翻译的; 为提供更全面的语文资源图解,我们研究了156个公开提供的国家语言方案数据集的特征; 我们人工说明这些数据集是如何创建的,包括输入文本和标签来源以及用于构建这些数据集的工具,以及它们研究什么,它们处理哪些任务和创建这些数据集的动机; 在量化不同语言的国家语言方案质量资源差距之后,我们讨论如何改进以低资源语言收集数据的工作; 我们调查每个语言的熟练国家语言方案研究人员和人群工作者,发现其估计可用性与提供数据集有关; 通过众包实验,我们确定了在机械土耳其平台上收集高质量多语种数据的战略; 我们最后,我们向国家语言方案社区和个人研究人员提出宏观和微观层面的建议,以便今后多语种数据开发。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日