Cross-document event coreference resolution (CDCR) is an NLP task in which mentions of events need to be identified and clustered throughout a collection of documents. CDCR aims to benefit downstream multi-document applications, but despite recent progress on corpora and model development, downstream improvements from applying CDCR have not been shown yet. The reason lies in the fact that every CDCR system released to date was developed, trained, and tested only on a single respective corpus. This raises strong concerns on their generalizability --- a must-have for downstream applications where the magnitude of domains or event mentions is likely to exceed those found in a curated corpus. To approach this issue, we define a uniform evaluation setup involving three CDCR corpora: ECB+, the Gun Violence Corpus and the Football Coreference Corpus (which we reannotate on token level to make our analysis possible). We compare a corpus-independent, feature-based system against a recent neural system developed for ECB+. Whilst being inferior in absolute numbers, the feature-based system shows more consistent performance across all corpora whereas the neural system is hit-and-miss. Via model introspection, we find that the importance of event actions, event time, etc. for resolving coreference in practice varies greatly between the corpora. Additional analysis shows that several systems overfit on the structure of the ECB+ corpus. We conclude with recommendations on how to move beyond corpus-tailored CDCR systems in the future -- the most important being that evaluation on multiple CDCR corpora is strongly necessary. To facilitate future research, we release our dataset, annotation guidelines, and model implementation to the public.
翻译:跨文件事件关联分辨率(CDCR)是一项NLP任务,其中提及的事件需要确定,并汇总到文件集集中。CDCR旨在让下游多文件应用程序受益,但尽管最近在公司和模型开发方面有所进展,但应用CDCR的下游改进尚未显现出来。其原因是,迄今为止发布的每个CDCR系统都是开发、培训、仅在一个单个材料上测试。这引起了人们对其普遍性的强烈关切 -- -- 对于下游应用程序来说,提及的事件的规模可能超过整理资料库中发现的数量。为解决这一问题,我们定义了一个统一的评价设置,涉及三个CDCR公司:欧洲央加、枪支暴力公司和足球公司Corporation(我们用象征性的注解来使我们的分析成为可能)。我们比较了一个基于特征的系统与最近为欧洲央行+开发的神经系统相比较。尽管其绝对数量较低,但基于特征的系统显示所有公司都具有更加一致的业绩,而神经系统则是最精确的,而神经系统则是最精确的,我们在未来的系统上更精确的排序, 也显示一个更加重要的CDCorual 。