We propose the Tough Mentions Recall (TMR) metrics to supplement traditional named entity recognition (NER) evaluation by examining recall on specific subsets of "tough" mentions: unseen mentions, those whose tokens or token/type combination were not observed in training, and type-confusable mentions, token sequences with multiple entity types in the test data. We demonstrate the usefulness of these metrics by evaluating corpora of English, Spanish, and Dutch using five recent neural architectures. We identify subtle differences between the performance of BERT and Flair on two English NER corpora and identify a weak spot in the performance of current models in Spanish. We conclude that the TMR metrics enable differentiation between otherwise similar-scoring systems and identification of patterns in performance that would go unnoticed from overall precision, recall, and F1.
翻译:我们建议采用 " 尖锐点评 " (TMR)衡量标准,以补充传统名称实体识别(NER)评估,方法是回顾 " 牙点 " 提到的具体子集:隐名提法,在培训中未观察到其象征性或象征性/类型组合的代号或代号/型号/型号可变提法,在测试数据中提及具有多个实体类型的象征性序列,我们用最近五个神经结构对英文、西班牙文和荷兰公司进行了评估,以表明这些衡量标准的作用。我们发现BERT和Flair在两个英国净化公司的表现之间存在细微差别,并找出目前西班牙模式表现中的一个薄弱环节。我们的结论是,TMR衡量标准可以区分其他相似的分级系统,并查明不注意整体精确性、回顾性和F1的性能模式。我们的结论是,TMR衡量标准可以区分其他相似的分级系统。