Named Entity Recognition (NER) is a critical task in natural language processing, yet it remains particularly challenging for discontinuous entities. The primary difficulty lies in text segmentation, as traditional methods often missegment or entirely miss cross-sentence discontinuous entities, significantly affecting recognition accuracy. Therefore, we aim to address the segmentation and omission issues associated with such entities. Recent studies have shown that grid-tagging methods are effective for information extraction due to their flexible tagging schemes and robust architectures. Building on this, we integrate image data augmentation techniques, such as cropping, scaling, and padding, into grid-based models to enhance their ability to recognize discontinuous entities and handle segmentation challenges. Experimental results demonstrate that traditional segmentation methods often fail to capture cross-sentence discontinuous entities, leading to decreased performance. In contrast, our augmented grid models achieve notable improvements. Evaluations on the CADEC, ShARe13, and ShARe14 datasets show F1 score gains of 1-2.5% overall and 3.7-8.4% for discontinuous entities, confirming the effectiveness of our approach.
翻译:命名实体识别(NER)是自然语言处理中的关键任务,但对于间断实体的识别仍极具挑战。主要困难在于文本分割,传统方法常错误分割或完全遗漏跨句的间断实体,显著影响识别准确率。因此,我们旨在解决与此类实体相关的分割与遗漏问题。近期研究表明,网格标注方法因其灵活的标注方案和鲁棒的架构,在信息抽取中表现有效。基于此,我们将图像数据增强技术(如裁剪、缩放和填充)整合到基于网格的模型中,以提升其识别间断实体和处理分割挑战的能力。实验结果表明,传统分割方法常无法捕捉跨句间断实体,导致性能下降。相比之下,我们增强后的网格模型取得了显著改进。在CADEC、ShARe13和ShARe14数据集上的评估显示,整体F1分数提升了1-2.5%,间断实体的F1分数提升了3.7-8.4%,验证了我们方法的有效性。