Named Entity Recognition (NER) performance often degrades rapidly when applied to target domains that differ from the texts observed during training. When in-domain labelled data is available, transfer learning techniques can be used to adapt existing NER models to the target domain. But what should one do when there is no hand-labelled data for the target domain? This paper presents a simple but powerful approach to learn NER models in the absence of labelled data through weak supervision. The approach relies on a broad spectrum of labelling functions to automatically annotate texts from the target domain. These annotations are then merged together using a hidden Markov model which captures the varying accuracies and confusions of the labelling functions. A sequence labelling model can finally be trained on the basis of this unified annotation. We evaluate the approach on two English datasets (CoNLL 2003 and news articles from Reuters and Bloomberg) and demonstrate an improvement of about 7 percentage points in entity-level $F_1$ scores compared to an out-of-domain neural NER model.
翻译:命名实体识别(NER) 性能在应用到与培训期间所观察到的文本不同的目标领域时,通常会迅速降低。当有内部标签数据时,可以使用转让学习技术将现有的净化模型调整到目标领域。但是,当目标领域没有手工标签数据时,人们应该做些什么?本文提出了一个简单而有力的方法,在没有标签数据的情况下,通过薄弱的监督来学习净化模型。这种方法依靠广泛的标签功能来自动批注目标领域的文本。然后将这些说明合并在一起,使用隐藏的Markov 模型来捕捉标签功能的不同精度和混乱。一个序列标签模型最终可以在这一统一说明的基础上接受培训。我们评估了两个英国数据集(CONNLL(2003年)和路透社和布隆伯格的新闻文章)上的方法,并表明与外部神经净化模型相比,实体一级1美元分数约7个百分点的改进。