We study the problem of training named entity recognition (NER) models using only distantly-labeled data, which can be automatically obtained by matching entity mentions in the raw text with entity types in a knowledge base. The biggest challenge of distantly-supervised NER is that the distant supervision may induce incomplete and noisy labels, rendering the straightforward application of supervised learning ineffective. In this paper, we propose (1) a noise-robust learning scheme comprised of a new loss function and a noisy label removal step, for training NER models on distantly-labeled data, and (2) a self-training method that uses contextualized augmentations created by pre-trained language models to improve the generalization ability of the NER model. On three benchmark datasets, our method achieves superior performance, outperforming existing distantly-supervised NER models by significant margins.
翻译:我们研究的是使用名称为实体识别(NER)模式的培训问题,这种培训只能使用遥远标签的数据,而这种数据可以通过在原始文本中与知识库中实体类型相对应的实体在原始文本中提及自动获得。远处监管的NER的最大挑战在于远程监管可能导致标签不完整和吵闹,使受监督学习的简单应用无效。在本文中,我们提议:(1) 一种噪音-气流学习计划,包括一个新的损失函数和噪音标签清除步骤,用于培训远程标签数据上的NER模型,以及(2) 一种自我培训方法,使用预先培训的语言模型创造的背景化扩增,以提高NER模型的通用能力。在三个基准数据集中,我们的方法取得了优异性,大大超越了现有的远处监督的NER模型。