As language technologies become more ubiquitous, there are increasing efforts towards expanding the language diversity and coverage of natural language processing (NLP) systems. Arguably, the most important factor influencing the quality of modern NLP systems is data availability. In this work, we study the geographical representativeness of NLP datasets, aiming to quantify if and by how much do NLP datasets match the expected needs of the language speakers. In doing so, we use entity recognition and linking systems, also making important observations about their cross-lingual consistency and giving suggestions for more robust evaluation. Last, we explore some geographical and economic factors that may explain the observed dataset distributions. Code and data are available here: https://github.com/ffaisal93/dataset_geography. Additional visualizations are available here: https://nlp.cs.gmu.edu/project/datasetmaps/.
翻译:随着语言技术越来越普遍,人们正日益努力扩大语言多样性和自然语言处理系统(NLP)的覆盖面。可以说,影响现代NLP系统质量的最重要因素是数据可得性。在这项工作中,我们研究了NLP数据集的地理代表性,目的是量化NLP数据集是否与语言发言者的预期需求相匹配,用多少数量来量化NLP数据集。在这样做时,我们使用实体识别和链接系统,也对其跨语言一致性提出重要意见,并为更强有力的评价提出建议。最后,我们探讨了一些可能解释所观察到的数据集分布的地理和经济因素。这里有代码和数据:https://github.com/ffaisal93/ dataset_gegraphy。这里还有额外的可视化信息:https://nlp.cs.gmu.edu/project/datasetmaps/。