Many recent named entity recognition (NER) studies criticize flat NER for its non-overlapping assumption, and switch to investigating nested NER. However, existing nested NER models heavily rely on training data annotated with nested entities, while labeling such data is costly. This study proposes a new subtask, nested-from-flat NER, which corresponds to a realistic application scenario: given data annotated with flat entities only, one may still desire the trained model capable of recognizing nested entities. To address this task, we train span-based models and deliberately ignore the spans nested inside labeled entities, since these spans are possibly unlabeled entities. With nested entities removed from the training data, our model achieves 54.8%, 54.2% and 41.1% F1 scores on the subset of spans within entities on ACE 2004, ACE 2005 and GENIA, respectively. This suggests the effectiveness of our approach and the feasibility of the task. In addition, the model's performance on flat entities is entirely unaffected. We further manually annotate the nested entities in the test set of CoNLL 2003, creating a nested-from-flat NER benchmark. Analysis results show that the main challenges stem from the data and annotation inconsistencies between the flat and nested entities.
翻译:最近许多名为实体识别(NER)的研究批评了无重叠假设的平面净化模型,并转向了调查嵌巢式净化模型。然而,现有的嵌巢式净化模型严重依赖与嵌巢实体附加说明的培训数据,同时标签这类数据的成本很高。本研究提出了一个新的子任务,即嵌入式自充气净化模型,这符合现实的应用设想:如果数据仅与固定实体附加说明,那么人们仍可能希望得到能够承认嵌巢实体的经过培训的模式。为了完成这项任务,我们培训了跨基建模型,并故意忽略了在标签式实体内嵌入的宽度,因为这些区域可能是没有标签的实体。随着嵌入式实体从培训数据中删除,我们的模型分别在ACE、ACE、ACE、2005和GENIA的实体内部的系列中达到了54.8%、54.2%和41.1% F1分。这表明我们的方法的有效性和任务的可行性。此外,该模型在固定实体的绩效完全不受影响。我们进一步在CONLLL、2003年测试集中将嵌套实体列为无标签的实体。我们从主基准和基准分析之间产生了一种不一致性。