With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset.
翻译:随着NLP领域越来越受欢迎,对低资源语言数据集的需求也随之而来。根据先前建立的框架,我们在本文件中介绍了UNER数据集,这是一个多语种和等级平行的、对名称实体附加说明的平行数据组。我们详细描述了在维基百科上用任何语言用DBpedia信息创建这类数据集所需的发达程序。三步程序从维基百科文章中提取实体,将其链接到DBpedia,并将DBpedia的分类组图与UNER标签相匹配。接着是后处理程序,大大提高了最终结果中已确定实体的数量。文件最后对由此产生的数据集进行了统计和定性分析。