We present MultiCoNER, a large multilingual dataset for Named Entity Recognition that covers 3 domains (Wiki sentences, questions, and search queries) across 11 languages, as well as multilingual and code-mixing subsets. This dataset is designed to represent contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities like movie titles, and long-tail entity distributions. The 26M token dataset is compiled from public resources using techniques such as heuristic-based sentence sampling, template extraction and slotting, and machine translation. We applied two NER models on our dataset: a baseline XLM-RoBERTa model, and a state-of-the-art GEMNET model that leverages gazetteers. The baseline achieves moderate performance (macro-F1=54%), highlighting the difficulty of our data. GEMNET, which uses gazetteers, improvement significantly (average improvement of macro-F1=+30%). MultiCoNER poses challenges even for large pre-trained language models, and we believe that it can help further research in building robust NER systems. MultiCoNER is publicly available at https://registry.opendata.aws/multiconer/ and we hope that this resource will help advance research in various aspects of NER.
翻译:我们为命名实体识别提供了多语种数据库,这是一个庞大的多语种数据库,涵盖11种语言的3个领域(维基句、问题和搜索查询)以及多语种和代码混合子集。该数据集旨在代表NER的当代挑战,包括低文本情景(短文本和未记录文本)、电影标题等综合复杂实体以及长尾实体分布。26M象征性数据集使用基于超自然的判刑抽样、模板抽取、插播和机器翻译等技术,从公共资源中汇编。我们在数据集上应用了两个NER模型:一个基线 XLM-ROBERTA模型,以及一个利用地名录的最先进的GEMNET模型。基准取得了中度的性能(macro-F1=54%),突出了我们数据的困难。GEMNET使用地名录,大大改进(宏观-F1 ⁇ 30% ) 。MultoneCON为大型预先培训语言模型提出了挑战,我们认为它能够帮助进一步研究强大的Opregreal NER系统。多COCONER将提供各种希望。