Cross-document event coreference resolution (CDCR) is an NLP task in which mentions of events need to be identified and clustered throughout a collection of documents. CDCR aims to benefit downstream multi-document applications, but despite recent progress on corpora and system development, downstream improvements from applying CDCR have not been shown yet. We make the observation that every CDCR system to date was developed, trained, and tested only on a single respective corpus. This raises strong concerns on their generalizability -- a must-have for downstream applications where the magnitude of domains or event mentions is likely to exceed those found in a curated corpus. To investigate this assumption, we define a uniform evaluation setup involving three CDCR corpora: ECB+, the Gun Violence Corpus and the Football Coreference Corpus (which we reannotate on token level to make our analysis possible). We compare a corpus-independent, feature-based system against a recent neural system developed for ECB+. Whilst being inferior in absolute numbers, the feature-based system shows more consistent performance across all corpora whereas the neural system is hit-and-miss. Via model introspection, we find that the importance of event actions, event time, etc. for resolving coreference in practice varies greatly between the corpora. Additional analysis shows that several systems overfit on the structure of the ECB+ corpus. We conclude with recommendations on how to achieve generally applicable CDCR systems in the future -- the most important being that evaluation on multiple CDCR corpora is strongly necessary. To facilitate future research, we release our dataset, annotation guidelines, and system implementation to the public.
翻译:跨文件事件参照分辨率(CDCR)是一项NLP任务,其中提及的事件需要确定,并汇总到文件集集中。CDCR旨在为下游多文件应用程序带来好处,但尽管最近在公司和系统开发方面有所进展,但应用CDCR的下游改进尚未显现出来。我们指出,迄今为止每个CDCR系统都是在单个材料的基础上开发、培训和测试的。这引起了人们对其普遍性的强烈关切。对于下游应用程序来说,必须有一个必须具备的提及的事件的提及可能超过整理材料库中发现的事件。为了调查这一假设,我们定义了一个统一的评价组,涉及三个CDCR公司:欧洲央行+、枪支暴力公司和足球协作公司(我们在象征性层面上重新说明,以使我们的分析成为可能)。我们比较了一个依赖、基于特征的系统与最近为欧洲央行+开发的神经系统相比较。尽管绝对数量较低,基于特征的系统显示所有公司的未来业绩可能更加一致。为了调查这一假设,我们定义了一个适用于神经系统的系统。我们定义一个可适用性的数据集成三个CoriaCRCora事件模型, 显示,对于CDCR的进度分析,对于CDCRLVIBVA的模型具有很强的模型, 进行更强烈的模型分析,在确定一个重大的模型,对于CDBRBRBRA的模型是超越。