Named Entity Recognition (NER) is a fundamental NLP task, commonly formulated as classification over a sequence of tokens. Morphologically-Rich Languages (MRLs) pose a challenge to this basic formulation, as the boundaries of Named Entities do not necessarily coincide with token boundaries, rather, they respect morphological boundaries. To address NER in MRLs we then need to answer two fundamental questions, namely, what are the basic units to be labeled, and how can these units be detected and classified in realistic settings, i.e., where no gold morphology is available. We empirically investigate these questions on a novel NER benchmark, with parallel tokenlevel and morpheme-level NER annotations, which we develop for Modern Hebrew, a morphologically rich-and-ambiguous language. Our results show that explicitly modeling morphological boundaries leads to improved NER performance, and that a novel hybrid architecture, in which NER precedes and prunes morphological decomposition, greatly outperforms the standard pipeline, where morphological decomposition strictly precedes NER, setting a new performance bar for both Hebrew NER and Hebrew morphological decomposition tasks.
翻译:命名实体识别(NER)是一项基本的NLP任务,通常被设计成对一系列象征物的分类。 摩尔-里希语言(MRLs)对这一基本配方构成挑战,因为被命名实体的边界不一定与象征性边界一致,相反,它们尊重形态边界。 为了在MRLs中处理净化,我们随后需要回答两个基本问题,即,哪些基本单位需要贴上标签,这些单位如何在现实环境中(即没有金质形态的环境下)被检测和分类。我们实证地在新型NER基准上调查这些问题,并同时配有平行的象征性和模棱皮级净化说明,我们为现代希伯来语开发了这种标志性、形态上丰富和矛盾的语言。我们的结果显示,明确的形态边界建模可以改善NER的性能表现,以及一个新的混合结构,即NER先于和棱皮质形态变形变形,大大超出标准管道,在NER紧紧紧的前面设置了形态变形变形变形状态,为希伯来和变形任务的新表现栏。