Named entity recognition (NER) aims to identify mentions of named entities in an unstructured text and classify them into predefined named entity classes. While deep learning-based pre-trained language models help to achieve good predictive performances in NER, many domain-specific NER applications still call for a substantial amount of labeled data. Active learning (AL), a general framework for the label acquisition problem, has been used for NER tasks to minimize the annotation cost without sacrificing model performance. However, the heavily imbalanced class distribution of tokens introduces challenges in designing effective AL querying methods for NER. We propose several AL sentence query evaluation functions that pay more attention to potential positive tokens, and evaluate these proposed functions with both sentence-based and token-based cost evaluation strategies. We also propose a better data-driven normalization approach to penalize sentences that are too long or too short. Our experiments on three datasets from different domains reveal that the proposed approach reduces the number of annotated tokens while achieving better or comparable prediction performance with conventional methods.
翻译:聚焦于潜在的命名实体在主动标注获得过程中的应用
命名实体识别(Named Entity Recognition, NER)旨在在非结构化文本中识别命名实体的提及,并将它们分类为预定义的命名实体类别。尽管基于深度学习的预训练语言模型帮助在NER中实现了良好的预测性能,但许多特定领域的NER应用仍需要大量标记的数据。主动学习(Active Learning, AL)是解决标注需求问题的一种通用框架,在NER任务中被用于最小化标注成本,而不影响模型性能。然而,标记令牌的极度失衡的类分布引入挑战,设计有效的AL查询方法来为NER提供很多的挑战。我们提出了几个AL查询评估函数,它们更加注重潜在的积极令牌,并使用基于句子的和基于令牌的成本评估策略来评估这些提出的函数。我们还提出了一种更好的数据驱动规范化方法来惩罚过长或过短的句子。我们在三个不同领域的数据集上进行了实验,结果表明所提出的方法减少了标注令牌的数量,同时实现了与传统方法相当或更好的预测性能。