When performing named entity recognition (NER), entity length is variable and dependent on a specific domain or dataset. Pre-trained language models (PLMs) are used to solve NER tasks and tend to be biased toward dataset patterns such as length statistics, surface form, and skewed class distribution. These biases hinder the generalization ability of PLMs, which is necessary to address many unseen mentions in real-world situations. We propose a novel debiasing method RegLER to improve predictions for entities of varying lengths. To close the gap between evaluation and real-world situations, we evaluated PLMs on partitioned benchmark datasets containing unseen mention sets. Here, RegLER shows significant improvement over long-named entities that can predict through debiasing on conjunction or special characters within entities. Furthermore, there is a severe class imbalance in most NER datasets, causing easy-negative examples to dominate during training, such as "The". Our approach alleviates skewed class distribution by reducing the influence of easy-negative examples. Extensive experiments on the biomedical and general domains demonstrated the generalization capabilities of our method. To facilitate reproducibility and future work, we release our code."https://github.com/minstar/RegLER"
翻译:当执行命名实体识别(NER)时,实体长度是可变的,取决于特定领域或数据集。使用预先培训的语言模型(PLM)解决 NER 任务,倾向于偏向于诸如长度统计、表面形式和偏斜类分布等数据集模式。这些偏见妨碍了PLM的普遍化能力,而这种能力对于解决现实世界局势中许多隐蔽的提及是必要的。我们提出一种新的贬低方法RegLER来改进对不同长度的实体的预测。为了缩小评价与现实世界局势之间的差距,我们评估了含有隐蔽参考集的分隔基准数据集的PLMs。在这里,RegLER显示,与长期命名的实体相比有了显著的改进,这些实体能够通过对连字符或实体内特殊字符进行分化预测。此外,大多数NL数据集存在严重的阶级不平衡,导致在培训期间容易受忽略的例子,例如“The”。我们的方法通过减少容易反常例的影响而减轻了偏向的班级分布。我们在生物医学和一般域上进行的广泛实验显示了我们的方法的一般化能力。促进再生/未来工作。