This paper investigates the problem of Named Entity Recognition (NER) for extreme low-resource languages with only a few hundred tagged data samples. NER is a fundamental task in Natural Language Processing (NLP). A critical driver accelerating NER systems' progress is the existence of large-scale language corpora that enable NER systems to achieve outstanding performance in languages such as English and French with abundant training data. However, NER for low-resource languages remains relatively unexplored. In this paper, we introduce Mask Augmented Named Entity Recognition (MANER), a new methodology that leverages the distributional hypothesis of pre-trained masked language models (MLMs) for NER. The <mask> token in pre-trained MLMs encodes valuable semantic contextual information. MANER re-purposes the <mask> token for NER prediction. Specifically, we prepend the <mask> token to every word in a sentence for which we would like to predict the named entity tag. During training, we jointly fine-tune the MLM and a new NER prediction head attached to each <mask> token. We demonstrate that MANER is well-suited for NER in low-resource languages; our experiments show that for 100 languages with as few as 100 training examples, it improves on state-of-the-art methods by up to 48% and by 12% on average on F1 score. We also perform detailed analyses and ablation studies to understand the scenarios that are best-suited to MANER.
翻译:本文调查了极低资源语言的命名实体识别(NER)问题,只有几百个标记的数据样本。 NER是自然语言处理(NLP)的一项基本任务。 加速NER系统进步的关键驱动因素是存在大规模语言公司,使NER系统能够以具有丰富培训数据的英语和法语等语言取得杰出的性能。然而,低资源语言的命名实体识别(NER)仍然相对没有被探索。在本文中,我们引入了“面具增强的命名实体识别(MANER) ” (MANER), 这是一种新方法,利用预先训练过的隐蔽语言模型(MMLMMM)的分布假设(MLMMMS) 。 加速NER系统进步的关键驱动因素是大规模语言系统(MLMS) 的标志(mask), 使NER 系统能够以大量培训数据(mask) 等语言实现杰出的成绩。 我们联合对MLMM(MMM) 和新的NER(NER) 预测(MMMM) ) 头号进行了精细化, 展示了我们平均的181 方法, 和低资源分析(Man- real- preal) 展示了每48 和12 显示了我们平均的18(BA) 方法。