As a fundamental natural language processing task and one of core knowledge extraction techniques, named entity recognition (NER) is widely used to extract information from texts for downstream tasks. Nested NER is a branch of NER in which the named entities (NEs) are nested with each other. However, most of the previous studies on nested NER usually apply linear structure to model the nested NEs which are actually accommodated in a hierarchical structure. Thus in order to address this mismatch, this work models the full nested NEs in a sentence as a holistic structure, then we propose a holistic structure parsing algorithm to disclose the entire NEs once for all. Besides, there is no research on applying corpus-level information to NER currently. To make up for the loss of this information, we introduce Point-wise Mutual Information (PMI) and other frequency features from corpus-aware statistics for even better performance by holistic modeling from sentence-level to corpus-level. Experiments show that our model yields promising results on widely-used benchmarks which approach or even achieve state-of-the-art. Further empirical studies show that our proposed corpus-aware features can substantially improve NER domain adaptation, which demonstrates the surprising advantage of our proposed corpus-level holistic structure modeling.
翻译:作为基本的自然语言处理任务和核心知识提取技术之一,命名实体识别(NER)被广泛用于从下游任务文本中提取信息。内置净净净值是净净值的一个分支,被命名实体相互嵌巢;然而,大多数以前关于嵌巢净值的研究通常采用线性结构来模拟实际上由等级结构容纳的嵌巢净值。因此,为了解决这一不匹配问题,这项工作模拟了作为整体结构的句子中全部嵌套的NE,然后我们提出了一种整体结构分析算法,以便一次性披露整个NE值。此外,目前没有关于将实体一级信息应用于净值的研究。为了弥补这种信息的损失,我们采用了点对点的相互信息和其他频率特征,从基本觉统计中引入了点性信息,以便通过从判决一级到实体一级的综合建模来提高绩效。实验表明,我们的模型在广泛使用的基准上取得了有希望的结果,这些基准是接近甚至达到最新水平的。进一步的实证研究显示,我们拟议的实体一级建模模型能够大大改善我们拟议的整体建模水平。