Current work in named entity recognition (NER) uses either cross entropy (CE) or conditional random fields (CRF) as the objective/loss functions to optimize the underlying NER model. Both of these traditional objective functions for the NER problem generally produce adequate performance when the data distribution is balanced and there are sufficient annotated training examples. But since NER is inherently an imbalanced tagging problem, the model performance under the low-resource settings could suffer using these standard objective functions. Based on recent advances in area under the ROC curve (AUC) maximization, we propose to optimize the NER model by maximizing the AUC score. We give evidence that by simply combining two binary-classifiers that maximize the AUC score, significant performance improvement over traditional loss functions is achieved under low-resource NER settings. We also conduct extensive experiments to demonstrate the advantages of our method under the low-resource and highly-imbalanced data distribution settings. To the best of our knowledge, this is the first work that brings AUC maximization to the NER setting. Furthermore, we show that our method is agnostic to different types of NER embeddings, models and domains. The code to replicate this work will be provided upon request.
翻译:摘要:目前的命名实体识别研究使用交叉熵或条件随机场作为优化底层NER模型的目标/损失函数。当数据分布平衡且有足够的标记训练示例时,这两种传统的目标函数通常可以产生足够的性能。但由于命名实体识别本质上是一个不平衡标记问题,模型在低资源情况下使用这些标准目标函数时可能性能下降。基于最近在ROC曲线下面积最大化方面的进展,我们建议通过最大化AUC得分来优化NER模型。我们提供证据表明,通过简单地组合两个最大化AUC得分的二元分类器,相对于传统损失函数,在低资源NER情况下可以实现显著的性能提升。我们还进行了广泛的实验来展示我们的方法在低资源和高度不平衡的数据分布设置下的优势。据我们所知,这是第一次将AUC最大化引入NER设置。此外,我们还展示了我们的方法对不同类型的NER嵌入,模型和领域的不可知性。请提出要求,我们会提供复制此工作的代码。