Named Entity Recognition (NER) is a well researched NLP task and is widely used in real world NLP scenarios. NER research typically focuses on the creation of new ways of training NER, with relatively less emphasis on resources and evaluation. Further, state of the art (SOTA) NER models, trained on standard datasets, typically report only a single performance measure (F-score) and we don't really know how well they do for different entity types and genres of text, or how robust are they to new, unseen entities. In this paper, we perform a broad evaluation of NER using a popular dataset, that takes into consideration various text genres and sources constituting the dataset at hand. Additionally, we generate six new adversarial test sets through small perturbations in the original test set, replacing select entities while retaining the context. We also train and test our models on randomly generated train/dev/test splits followed by an experiment where the models are trained on a select set of genres but tested genres not seen in training. These comprehensive evaluation strategies were performed using three SOTA NER models. Based on our results, we recommend some useful reporting practices for NER researchers, that could help in providing a better understanding of a SOTA model's performance in future.
翻译:命名实体识别(NER)是一项研究周全的NLP任务,在现实世界的NLP情景中被广泛使用。 NER研究通常侧重于创建新的培训NER方法,相对较少地强调资源和评估。此外,艺术(SOTA)NER模型,在标准数据集方面受过培训,通常只报告一个单一的业绩计量(F-score),我们并不真正知道它们对于不同实体类型和文本类型,或者它们对于新的、看不见的实体有多强。在本文中,我们使用流行的数据集对NER进行广泛的评估,其中考虑到构成手头数据集的各种文本和来源。此外,我们通过原始测试集中的小规模扰动,生成了6个新的对抗性测试组,在保留上下文的同时替换了某些实体。我们还对我们的模型进行了培训和测试,随后进行了实验,对模型进行了一套精选的基因组培训,但对一些未在培训中看到的基因组进行了测试。这些全面的评价战略通过最初的测试,我们用三个SOTA模型为未来的业绩模型提供了有用的评估方法。