We present the first large scale corpus for entity resolution in email conversations (CEREC). The corpus consists of 6001 email threads from the Enron Email Corpus containing 36,448 email messages and 60,383 entity coreference chains. The annotation is carried out as a two-step process with minimal manual effort. Experiments are carried out for evaluating different features and performance of four baselines on the created corpus. For the task of mention identification and coreference resolution, a best performance of 59.2 F1 is reported, highlighting the room for improvement. An in-depth qualitative and quantitative error analysis is presented to understand the limitations of the baselines considered.
翻译:在电子邮件对话中,我们提出了第一个大规模实体解决方案(CEREC),其中包括来自Enron Email Corpus的6001个电子邮件线索,其中载有36 448个电子邮件信息,60 383个实体共同链接链,这是一个分两步进行的批注过程,尽量减少人工劳动;为评估所创建的4个基线的不同特点和性能进行了实验;为进行提及识别和共同参考分辨率的任务,报告了59.2个F1的最佳表现,突出了改进的空间;为了解所考虑基线的局限性,进行了深入的定性和定量误差分析。