Named Entity Recognition (NER) is a foundational NLP task that aims to provide class labels like Person, Location, Organisation, Time, and Number to words in free text. Named Entities can also be multi-word expressions where the additional I-O-B annotation information helps label them during the NER annotation process. While English and European languages have considerable annotated data for the NER task, Indian languages lack on that front -- both in terms of quantity and following annotation standards. This paper releases a significantly sized standard-abiding Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens, annotated with 11 tags. We discuss the dataset statistics in all their essential detail and provide an in-depth analysis of the NER tag-set used with our data. The statistics of tag-set in our dataset show a healthy per-tag distribution, especially for prominent classes like Person, Location and Organisation. Since the proof of resource-effectiveness is in building models with the resource and testing the model on benchmark data and against the leader-board entries in shared tasks, we do the same with the aforesaid data. We use different language models to perform the sequence labelling task for NER and show the efficacy of our data by performing a comparative evaluation with models trained on another dataset available for the Hindi NER task. Our dataset helps achieve a weighted F1 score of 88.78 with all the tags and 92.22 when we collapse the tag-set, as discussed in the paper. To the best of our knowledge, no available dataset meets the standards of volume (amount) and variability (diversity), as far as Hindi NER is concerned. We fill this gap through this work, which we hope will significantly help NLP for Hindi. We release this dataset with our code and models at https://github.com/cfiltnlp/HiNER
翻译:命名实体识别( NER) 是一个基础性 NLP 任务, 目的是提供类标签, 如个人、 地点、 组织、 时间和 NNU 等, 以免费文本提供。 命名的实体也可以是多字表达, 额外的 I- O- B 注释信息有助于在 NER 批注过程中给它们贴标签。 虽然英语和欧洲语言对于 NER 任务有相当的附加说明数据, 但印度语言在前端缺乏, 无论是数量还是批注标准。 本文发布一个大尺寸的标准标准性印地语净化数据集, 包含109, 146 句和 2, 220, 856 符号, 附加11个最佳标记。 我们也可以讨论数据集统计数据集的所有基本细节, 提供对 NER 标记集的深入分析。 我们的数据集中的标记显示一个健康的每张牌分布分布分布, 特别是人、 地点和组织等著名阶级。 由于资源效益的证明是在建立模型, 测试基准数据模型, 以及共同任务中的头版数据条目, 我们用最起码的数据序列来进行相关的数据 。 我们使用一个不同的数据 。