While the NLP community is generally aware of resource disparities among languages, we lack research that quantifies the extent and types of such disparity. Prior surveys estimating the availability of resources based on the number of datasets can be misleading as dataset quality varies: many datasets are automatically induced or translated from English data. To provide a more comprehensive picture of language resources, we examine the characteristics of 156 publicly available NLP datasets. We manually annotate how they are created, including input text and label sources and tools used to build them, and what they study, tasks they address and motivations for their creation. After quantifying the qualitative NLP resource gap across languages, we discuss how to improve data collection in low-resource languages. We survey language-proficient NLP researchers and crowd workers per language, finding that their estimated availability correlates with dataset availability. Through crowdsourcing experiments, we identify strategies for collecting high-quality multilingual data on the Mechanical Turk platform. We conclude by making macro and micro-level suggestions to the NLP community and individual researchers for future multilingual data development.
翻译:虽然国家语言方案社区普遍了解不同语言的资源差异,但我们缺乏对此类差异的程度和类型进行量化的研究; 先前根据数据集数量估计资源可用性的调查可能误导,因为数据集的质量各异:许多数据集是自动引出或由英文数据翻译的; 为提供更全面的语文资源图解,我们研究了156个公开提供的国家语言方案数据集的特征; 我们人工说明这些数据集是如何创建的,包括输入文本和标签来源以及用于构建这些数据集的工具,以及它们研究什么,它们处理哪些任务和创建这些数据集的动机; 在量化不同语言的国家语言方案质量资源差距之后,我们讨论如何改进以低资源语言收集数据的工作; 我们调查每个语言的熟练国家语言方案研究人员和人群工作者,发现其估计可用性与提供数据集有关; 通过众包实验,我们确定了在机械土耳其平台上收集高质量多语种数据的战略; 我们最后,我们向国家语言方案社区和个人研究人员提出宏观和微观层面的建议,以便今后多语种数据开发。