交叉文件共同参考分辨率数据集中对多样性的定性和定量分析 (Qualitative and Quantitative Analysis of Diversity in Cross-document Coreference Resolution Datasets)

Cross-document coreference resolution (CDCR) datasets, such as ECB+, contain manually annotated event-centric mentions of events and entities that form coreference chains with identity relations. ECB+ is a state-of-the-art CDCR dataset that focuses on the resolution of events and their descriptive attributes, i.e., actors, location, and date-time. NewsWCL50 is a dataset that annotates coreference chains of both events and entities with a strong variance of word choice and more loosely-related coreference anaphora, e.g., bridging or near-identity relations. In this paper, we qualitatively and quantitatively compare annotation schemes of ECB+ and NewsWCL50 with multiple criteria. We propose a phrasing diversity metric (PD) that compares lexical diversity within coreference chains on a more detailed level than previously proposed metric, e.g., a number of unique lemmas. We discuss the different tasks that both CDCR datasets create, i.e., lexical disambiguation and lexical diversity challenges, and propose a direction for further CDCR evaluation.

翻译：文件交叉参考分辨率(CDCR)数据库,如欧洲央行+, 包含以事件为中心的人工说明性提及事件和实体,形成与身份关系相连接的连锁关系。欧洲央行+是一个最先进的CDCR数据集,侧重于事件的解决及其描述属性,即行为者、地点和日期时间。NewsWCL50是一个数据集,其中注明事件和实体的连锁链接,其单词选择差异很大,而且互连性更松散。在本文件中,我们从质量和数量上比较欧洲央行+和NewsWCL50的批注计划与多重标准。我们建议采用多样性指标(PD),将共同参照链中的词汇多样性比先前提议的指标更详细,例如,一些独特的红皮。我们讨论了CDCR数据集带来的不同任务,即词汇上的断裂和字典多样性的挑战,并提出进一步评价的方向。