We point out that common evaluation practices for cross-document coreference resolution have been unrealistically permissive in their assumed settings, yielding inflated results. We propose addressing this issue via two evaluation methodology principles. First, as in other tasks, models should be evaluated on predicted mentions rather than on gold mentions. Doing this raises a subtle issue regarding singleton coreference clusters, which we address by decoupling the evaluation of mention detection from that of coreference linking. Second, we argue that models should not exploit the synthetic topic structure of the standard ECB+ dataset, forcing models to confront the lexical ambiguity challenge, as intended by the dataset creators. We demonstrate empirically the drastic impact of our more realistic evaluation principles on a competitive model, yielding a score which is 33 F1 lower compared to evaluating by prior lenient practices.
翻译:我们指出,用于交叉文件共同参照决议的共同评价做法在其假设环境中是不切实际的宽松做法,产生了夸大的结果。我们建议通过两个评价方法原则来解决这一问题。首先,与其他任务一样,模式应该根据预测的提及而不是黄金的提及来评估。这样做提出了单吨共同参照组的微妙问题,我们通过将提及检测的评价与共同参照链接的关联分开来解决。第二,我们主张模式不应该利用欧洲央行+标准数据集的合成主题结构,迫使模型像数据集创建者所想的那样面对词汇上的模糊性挑战。我们从经验上证明了我们更现实的评价原则对竞争模式的极大影响,比以往的宽大做法所评估的得分要低33 F1。