Entity Recognition (ER) within a text is a fundamental exercise in Natural Language Processing, enabling further depending tasks such as Knowledge Extraction, Text Summarisation, or Keyphrase Extraction. An entity consists of single words or of a consecutive sequence of terms, constituting the basic building blocks for communication. Mainstream ER approaches are mainly limited to flat structures, concentrating on the outermost entities while ignoring the inner ones. This paper introduces a partly-layered network architecture that deals with the complexity of overlapping and nested cases. The proposed architecture consists of two parts: (1) a shared Sequence Layer and (2) a stacked component with multiple Tagging Layers. The adoption of such an architecture has the advantage of preventing overfit to a specific word-length, thus maintaining performance for longer entities despite their lower frequency. To verify the proposed architecture's effectiveness, we train and evaluate this architecture to recognise two kinds of entities - Concepts (CR) and Named Entities (NER). Our approach achieves state-of-the-art NER performances, while it outperforms previous CR approaches. Considering these promising results, we see the possibility to evolve the architecture for other cases such as the extraction of events or the detection of argumentative components.
翻译:文本中的实体识别(ER)是自然语言处理中的一项基本工作,使得能够进一步确定诸如知识提取、文本摘要或关键词提取等任务。实体由单字或连续的术语序列组成,构成通信的基本构件。主流ER方法主要局限于平板结构,侧重于最外层实体,而忽视内部实体。本文介绍了一个处理重叠和嵌套案例复杂性的局部网络结构。拟议结构由两部分组成:(1) 共享序列层和(2) 多层粘贴层的堆叠组件。采用这种结构的优势在于防止过长于特定的单字长度,从而保持更长实体的性能,尽管其频率较低。为核实拟议架构的有效性,我们培训和评估这一结构,以识别两类实体――概念和命名实体。我们的方法取得了最新、最先进的NER绩效,而它比以往的CR方法要好。考虑到这些有希望的结果,我们发现有可能为其他案例制定结构,例如提取事件的组成部分。