Being able to train Named Entity Recognition (NER) models for emerging topics is crucial for many real-world applications especially in the medical domain where new topics are continuously evolving out of the scope of existing models and datasets. For a realistic evaluation setup, we introduce a novel COVID-19 news NER dataset (COVIDNEWS-NER) and release 3000 entries of hand annotated strongly labelled sentences and 13000 auto-generated weakly labelled sentences. Besides the dataset, we propose CONTROSTER, a recipe to strategically combine weak and strong labels in improving NER in an emerging topic through transfer learning. We show the effectiveness of CONTROSTER on COVIDNEWS-NER while providing analysis on combining weak and strong labels for training. Our key findings are: (1) Using weak data to formulate an initial backbone before tuning on strong data outperforms methods trained on only strong or weak data. (2) A combination of out-of-domain and in-domain weak label training is crucial and can overcome saturation when being training on weak labels from a single source.
翻译:对于许多现实世界应用领域,特别是在医学领域,新的专题正在不断从现有模型和数据集的范围中演化出来,因此,能够对新兴专题进行命名实体识别模型培训至关重要。对于现实的评估设置,我们引入了一个新的COVID-19新闻净化数据集(COVIDNEWS-NER),并发布了3000个手语条目,加注了强烈标签的句子和13000个自制标签薄弱的句子。除了数据集外,我们提议CONTROSTER, 一种在战略上将弱和强的标签结合起来,通过转让学习来改进新兴专题中的净化。我们展示了COVIDNEWS-NER的委员会有效性,同时提供了将弱和强的标签结合起来用于培训的分析。我们的主要结论是:(1) 使用薄弱的数据,在调整只受过强或弱数据培训的强数据外和软标签方法之前,先形成初步的骨干。(2) 将外部和内部薄弱的标签培训结合起来至关重要,而且当从单一来源接受关于弱标签的培训时可以克服饱和。