Cross-domain named entity recognition (NER) models are able to cope with the scarcity issue of NER samples in target domains. However, most of the existing NER benchmarks lack domain-specialized entity types or do not focus on a certain domain, leading to a less effective cross-domain evaluation. To address these obstacles, we introduce a cross-domain NER dataset (CrossNER), a fully-labeled collection of NER data spanning over five diverse domains with specialized entity categories for different domains. Additionally, we also provide a domain-related corpus since using it to continue pre-training language models (domain-adaptive pre-training) is effective for the domain adaptation. We then conduct comprehensive experiments to explore the effectiveness of leveraging different levels of the domain corpus and pre-training strategies to do domain-adaptive pre-training for the cross-domain task. Results show that focusing on the fractional corpus containing domain-specialized entities and utilizing a more challenging pre-training strategy in domain-adaptive pre-training are beneficial for the NER domain adaptation, and our proposed method can consistently outperform existing cross-domain NER baselines. Nevertheless, experiments also illustrate the challenge of this cross-domain NER task. We hope that our dataset and baselines will catalyze research in the NER domain adaptation area. The code and data are available at https://github.com/zliucr/CrossNER.
翻译:交叉命名实体识别(NER)模型能够应对目标领域NER样本的稀缺问题,然而,大多数现有NER基准缺乏领域专门实体类型,或者没有侧重于某个领域,导致跨领域评价效果较差。为克服这些障碍,我们引入了跨领域NER数据集(CrossNER),这是覆盖五个不同领域的、具有不同领域专门实体类别的完全标签的NER数据数据收集系统。此外,我们还提供了一个域相关数据系统,因为利用该系统继续培训前语言模型(主要适应前培训)对领域适应有效。然后,我们开展全面实验,探索利用不同层次的域库和培训前战略开展跨领域培训前培训以完成跨领域培训的实效。结果显示,侧重于包含域专门实体的分包,利用更具有挑战性的培训前战略,有利于NER域的适应,而我们拟议的方法可以持续超越现有跨领域NER/NER基准领域研究的实效。 我们还将展示我们现有的CER/基准领域研究的跨领域测试。