Named Entity Recognition (NER) plays an important role in a wide range of natural language processing tasks, such as relation extraction, question answering, etc. However, previous studies on NER are limited to a particular genre, using small manually-annotated or large but low-quality datasets. In this work, we propose a semi-supervised annotation framework to make full use of abstracts from Wikipedia and obtain a large and high-quality dataset called AnchorNER. We assume anchored strings in abstracts are named entities and annotate them with entity types mentioned in DBpedia. To improve the coverage, we design a neural correction model trained with a human-annotated NER dataset, DocRED, to correct the false-negative entity labels, and then train a BERT model with the corrected dataset. We evaluate our trained model on six NER datasets and our experimental results show that we have obtained state-of-the-art open-domain performances --- on top of the strong baselines BERT-base and BERT-large, we achieve relative improvements of 4.66% and 3.07% respectively.
翻译:命名实体识别(NER)在广泛的自然语言处理任务(如关系提取、问答等)中发挥着重要作用。然而,以往关于NER的研究仅限于特定类型,使用小型人工附加说明或大但低质量的数据集。在这项工作中,我们提议了一个半监督的批注框架,以充分利用维基百科摘要,并获得一个称为AnchorNER的大型高质量数据集。我们在摘要中假定有固定的字符串是点名实体,用DBpedia中提到的实体类型予以说明。为了扩大覆盖范围,我们设计了一个神经校正模型,以人注解的NER数据集(DocRED)为培训,以纠正错误的负实体标签,然后用校正的数据集培训BERT模型。我们评价了我们经过培训的六个NER数据集模型以及我们的实验结果显示,我们获得了先进的开放域功能 -- -- 在强大的基线BERT基地和BERT大基线之上 -- -- -- 我们分别实现了4.66%和3.07的相对改进率。