Nested Named Entity Recognition (NNER) has been a long-term challenge to researchers as an important sub-area of Named Entity Recognition. NNER is where one entity may be part of a longer entity, and this may happen on multiple levels, as the term nested suggests. These nested structures make traditional sequence labeling methods unable to properly recognize all entities. While recent researches focus on designing better recognition methods for NNER in a variety of languages, the Chinese NNER (CNNER) still lacks attention, where a free-for-access, CNNER-specialized benchmark is absent. In this paper, we aim to solve CNNER problems by providing a Chinese dataset and a learning-based model to tackle the issue. To facilitate the research on this task, we release ChiNesE, a CNNER dataset with 20,000 sentences sampled from online passages of multiple domains, containing 117,284 entities failing in 10 categories, where 43.8 percent of those entities are nested. Based on ChiNesE, we propose Mulco, a novel method that can recognize named entities in nested structures through multiple scopes. Each scope use a designed scope-based sequence labeling method, which predicts an anchor and the length of a named entity to recognize it. Experiment results show that Mulco has outperformed several baseline methods with the different recognizing schemes on ChiNesE. We also conduct extensive experiments on ACE2005 Chinese corpus, where Mulco has achieved the best performance compared with the baseline methods.
翻译:NNNER是某个实体可能属于一个较长实体的一部分,而且正如“嵌套”一词所暗示的那样,它可能在多个层面上发生。这些嵌套结构使得传统的序列标签方法无法正确识别所有实体。虽然最近的研究侧重于设计各种语言的NNNER更好的识别方法,但中国NNER(NNER)仍然是对研究人员的一个长期挑战,因为没有免费进入的、CNNER专门基准。在本文件中,我们的目标是通过提供中国数据集和学习模型来解决CNNER问题。为了便利这项工作的研究,我们发布了ChiNesE,这是CNNER的数据集,有20,000个从多个域的在线通道抽取的句子。有117,284个实体在10个类别中不合格,其中43.8%的实体是嵌入的。我们建议Mulco,这是一种新颖的方法,可以通过多个范围来识别嵌入结构中的名称实体。我们用Mulco的模型比较了一个广泛的绩效模型,每个范围比中国的模型都使用一种规模,其中标定了一种规模的模型,其中标定了一种方法,其中标定了几个基级的基级的模型。