Named Entity Recognition (NER) is an important task in natural language processing. However, traditional supervised NER requires large-scale annotated datasets. Distantly supervision is proposed to alleviate the massive demand for datasets, but datasets constructed in this way are extremely noisy and have a serious unlabeled entity problem. The cross entropy (CE) loss function is highly sensitive to unlabeled data, leading to severe performance degradation. As an alternative, we propose a new loss function called NRCES to cope with this problem. A sigmoid term is used to mitigate the negative impact of noise. In addition, we balance the convergence and noise tolerance of the model according to samples and the training process. Experiments on synthetic and real-world datasets demonstrate that our approach shows strong robustness in the case of severe unlabeled entity problem, achieving new state-of-the-art on real-world datasets.
翻译:命名实体识别(NER)是自然语言处理中的一项重要任务。然而,传统的受监督实体识别(NER)是传统语言处理中的一项重要任务。然而,传统的受监督实体需要大规模附加说明的数据集。建议进行不懈的监督以缓解对数据集的大量需求,但以这种方式构建的数据集非常吵闹,并存在严重的无标签实体问题。交叉环球(CE)损失功能对未标签数据非常敏感,导致严重性能退化。作为一种替代办法,我们提议了一个新的损失功能,称为NRCES,以解决这一问题。使用一个小类术语来缓解噪音的负面影响。此外,我们根据样本和培训过程来平衡模型的趋同和噪音耐受度。对合成和真实世界数据集的实验表明,在严重无标签实体问题的情况下,我们的方法非常稳健,在现实世界数据集上实现了新的状态。