The goal in the NER task is to classify proper nouns of a text into classes such as person, location, and organization. This is an important preprocessing step in many NLP tasks such as question-answering and summarization. Although many research studies have been conducted in this area in English and the state-of-the-art NER systems have reached performances of higher than 90 percent in terms of F1 measure, there are very few research studies for this task in Persian. One of the main important causes of this may be the lack of a standard Persian NER dataset to train and test NER systems. In this research we create a standard, big-enough tagged Persian NER dataset which will be distributed for free for research purposes. In order to construct such a standard dataset, we studied standard NER datasets which are constructed for English researches and found out that almost all of these datasets are constructed using news texts. So we collected documents from ten news websites. Later, in order to provide annotators with some guidelines to tag these documents, after studying guidelines used for constructing CoNLL and MUC standard English datasets, we set our own guidelines considering the Persian linguistic rules.
翻译:NER任务的目标是将文本的适当名词分类为诸如人、地点和组织等类别,这是许多NLP任务的重要预处理步骤,例如问答和总结。虽然在这方面已经用英语进行了许多研究,而且最先进的NER系统在F1计量方面达到90%以上的性能,但在波斯对这项任务的研究中却很少。主要原因之一可能是缺乏标准波斯NER数据集来培训和测试NER系统。在这个研究中,我们创建了一个标准、大加标记的波斯NER数据集,将免费分发用于研究目的。为了建立这样一个标准数据集,我们研究了为英语研究而建造的标准NER数据集,发现几乎所有这些数据集都是用新闻文本构建的。因此,我们从十个新闻网站收集了文件。后来,为了提供标记这些文件的一些指南,我们在研究了用于建造CONLLL和MUC标准英语数据集的指南之后,设置了我们自己的语言指南。