Named Entity Recognition (NER) is the task of identifying and classifying named entities in large-scale texts into predefined classes. NER in French and other relatively limited-resource languages cannot always benefit from approaches proposed for languages like English due to a dearth of large, robust datasets. In this paper, we present our work that aims to mitigate the effects of this dearth of large, labeled datasets. We propose a Transformer-based NER approach for French, using adversarial adaptation to similar domain or general corpora to improve feature extraction and enable better generalization. Our approach allows learning better features using large-scale unlabeled corpora from the same domain or mixed domains to introduce more variations during training and reduce overfitting. Experimental results on three labeled datasets show that our adaptation framework outperforms the corresponding non-adaptive models for various combinations of Transformer models, source datasets, and target corpora. We also show that adversarial adaptation to large-scale unlabeled corpora can help mitigate the performance dip incurred on using Transformer models pre-trained on smaller corpora.
翻译:命名实体识别(NER)是查明大型文本中被命名实体并将其分类为预定义类别的任务。 法语和其他相对有限的资源语言的净化不能总是受益于为英语等语言提议的方法,因为缺少大量、可靠的数据集。 在本文中,我们介绍了我们的工作,目的是减轻大量标签化数据集短缺的影响。我们为法国提出了基于变换的NER方法,利用对类似域或一般公司进行对抗性调整,以改进特征提取,并更好地概括化。我们的方法允许利用同一领域或混合领域的大型无标签公司学习更好的特征,以便在培训期间引入更多的变异和减少过度配装。三个标签化数据集的实验结果表明,我们的适应框架超越了各种变异模型、源数据集和目标公司组合的相应非适应模型。我们还表明,对大型无标签公司进行对抗性调整,有助于减轻在对较小的公司进行预先培训的变异模型使用上产生的性差。