Masader:阿拉伯文文本和语音数据资源元数据 (Masader: Metadata Sourcing for Arabic Text and Speech Data Resources)

The NLP pipeline has evolved dramatically in the last few years. The first step in the pipeline is to find suitable annotated datasets to evaluate the tasks we are trying to solve. Unfortunately, most of the published datasets lack metadata annotations that describe their attributes. Not to mention, the absence of a public catalogue that indexes all the publicly available datasets related to specific regions or languages. When we consider low-resource dialectical languages, for example, this issue becomes more prominent. In this paper we create \textit{Masader}, the largest public catalogue for Arabic NLP datasets, which consists of 200 datasets annotated with 25 attributes. Furthermore, We develop a metadata annotation strategy that could be extended to other languages. We also make remarks and highlight some issues about the current status of Arabic NLP datasets and suggest recommendations to address them.

翻译：在过去几年里,NLP管道发生了巨大变化。管道的第一步是找到合适的附加说明的数据集来评估我们正在试图解决的任务。不幸的是, 大部分公布的数据集缺乏能描述其属性的元数据说明。更不用提的是, 没有一份公共目录来将所有与特定区域或语言有关的公开数据集索引起来。例如, 当我们考虑低资源辩证语言时, 这个问题就变得更加突出。在本文中, 我们创建了 \ textit{ masader}, 这是阿拉伯NLP 数据集的最大公共目录, 由200个附加25个属性的数据集组成。此外, 我们制定了元数据说明战略, 可以推广到其他语言。我们还就阿拉伯NLP数据集的现状发表看法和强调一些问题, 并提出解决这些问题的建议。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

机器学习简明导论，62页pdf

专知会员服务

83+阅读 · 2021年7月31日

迁移学习简明教程，11页ppt

专知会员服务

109+阅读 · 2020年8月4日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日