Named entity recognition (NER) aims to identify mentions of named entities in an unstructured text and classify them into the predefined named entity classes. Even though deep learning-based pre-trained language models achieve good predictive performances, many domain-specific NERtasks still require a sufficient amount of labeled data. Active learning (AL), a general framework for the label acquisition problem, has been used for the NER tasks to minimize the annotation cost without sacrificing model performance. However, heavily imbalanced class distribution of tokens introduces challenges in designing effective AL querying methods for NER. We propose AL sentence query evaluation functions which pay more attention to possible positive tokens, and evaluate these proposed functions with both sentence-based and token-based cost evaluation strategies. We also propose a better data-driven normalization approach to penalize too long or too short sentences. Our experiments on three datasets from different domains reveal that the proposed approaches reduce the number of annotated tokens while achieving better or comparable prediction performance with conventional methods.
翻译:命名实体识别(NER)的目的是查明在未结构化文本中提及被点名实体之处,并将其划入预先界定的实体类别。尽管深层次的学习前培训语言模型取得良好的预测性表现,但许多具体领域的NERTAS仍然需要足够的标签数据。积极学习(AL)是标签获取问题的一般框架,用于NER任务,以尽量减少批注成本,同时又不牺牲示范性表现。然而,严重不平衡的批量分配给设计有效的AL NER查询方法带来了挑战。我们提议AL 句子查询评估功能,更多地关注可能的正面标志,用基于判决和基于象征性的成本评估战略来评估这些拟议功能。我们还提议一种更好的数据驱动的正常化方法,以惩罚过长或过短的刑期。我们对不同领域的三个数据集的实验表明,拟议方法减少了附加说明的代号的数量,同时用常规方法实现更好或可比的预测性能。