The NLP pipeline has evolved dramatically in the last few years. The first step in the pipeline is to find suitable annotated datasets to evaluate the tasks we are trying to solve. Unfortunately, most of the published datasets lack metadata annotations that describe their attributes. Not to mention, the absence of a public catalogue that indexes all the publicly available datasets related to specific regions or languages. When we consider low-resource dialectical languages, for example, this issue becomes more prominent. In this paper we create \textit{Masader}, the largest public catalogue for Arabic NLP datasets, which consists of 200 datasets annotated with 25 attributes. Furthermore, We develop a metadata annotation strategy that could be extended to other languages. We also make remarks and highlight some issues about the current status of Arabic NLP datasets and suggest recommendations to address them.
翻译:在过去几年里,NLP管道发生了巨大变化。 管道的第一步是找到合适的附加说明的数据集来评估我们正在试图解决的任务。 不幸的是, 大部分公布的数据集缺乏能描述其属性的元数据说明。 更不用提的是, 没有一份公共目录来将所有与特定区域或语言有关的公开数据集索引起来。 例如, 当我们考虑低资源辩证语言时, 这个问题就变得更加突出。 在本文中, 我们创建了 \ textit{ masader}, 这是阿拉伯NLP 数据集的最大公共目录, 由200个附加25个属性的数据集组成。 此外, 我们制定了元数据说明战略, 可以推广到其他语言。 我们还就阿拉伯NLP数据集的现状发表看法和强调一些问题, 并提出解决这些问题的建议 。