To address what we believe is a looming crisis of unreproducible evaluation for named entity recognition tasks, we present guidelines for reproducible evaluation. The guidelines we propose are extremely simple, focusing on transparency regarding how chunks are encoded and scored, but very few papers currently being published fully comply with them. We demonstrate that despite the apparent simplicity of NER evaluation, unreported differences in the scoring procedure can result in changes to scores that are both of noticeable magnitude and are statistically significant. We provide SeqScore, an open source toolkit that addresses many of the issues that cause replication failures and makes following our guidelines easy.
翻译:为解决我们认为对被点名的实体识别任务进行不可复制评价这一迫在眉睫的危机,我们提出了可复制评价的准则。我们提出的准则非常简单,侧重于块的编码和分数的透明度,但目前发表的论文很少完全符合这些准则。我们证明,尽管净值评价明显简单,但评分程序中未报告的差别可能导致得分发生变化,既明显,又具有统计意义。我们提供了SeqScore,这是一个开放源码工具包,处理许多造成复制失败的问题,并容易遵循我们的准则。