Recent studies in deep learning have shown significant progress in named entity recognition (NER). Most existing works assume clean data annotation, yet a fundamental challenge in real-world scenarios is the large amount of noise from a variety of sources (e.g., pseudo, weak, or distant annotations). This work studies NER under a noisy labeled setting with calibrated confidence estimation. Based on empirical observations of different training dynamics of noisy and clean labels, we propose strategies for estimating confidence scores based on local and global independence assumptions. We partially marginalize out labels of low confidence with a CRF model. We further propose a calibration method for confidence scores based on the structure of entity labels. We integrate our approach into a self-training framework for boosting performance. Experiments in general noisy settings with four languages and distantly labeled settings demonstrate the effectiveness of our method. Our code can be found at https://github.com/liukun95/Noisy-NER-Confidence-Estimation
翻译:最近的深层学习研究表明,在命名实体的识别(NER)方面取得了显著进展。大多数现有工程都假定了清洁数据说明,但在现实世界情景中,一个根本的挑战就是来自各种来源(如假、弱或遥远的注释)的大量噪音。这项工作在一个响亮、贴有标签、有校准信心估计的环境下进行NER研究。根据对噪音和清洁标签的不同培训动态的实证观察,我们根据当地和全球独立假设提出了估算信任分数的战略。我们部分地排除了使用通用报告格式模型的低信任标签。我们进一步提出了基于实体标签结构的信任分数校准方法。我们将我们的方法纳入提高绩效的自我培训框架。用四种语言和远贴有标签的环境进行的一般噪音实验证明了我们的方法的有效性。我们的代码可以在 https://github.com/liukun95/Noisy-NER-Confidence-Estimation上找到。