Entity matching (EM) is the most critical step for entity resolution (ER). While current deep learningbased methods achieve very impressive performance on standard EM benchmarks, their realworld application performance is much frustrating. In this paper, we highlight that such the gap between reality and ideality stems from the unreasonable benchmark construction process, which is inconsistent with the nature of entity matching and therefore leads to biased evaluations of current EM approaches. To this end, we build a new EM corpus and re-construct EM benchmarks to challenge critical assumptions implicit in the previous benchmark construction process by step-wisely changing the restricted entities, balanced labels, and single-modal records in previous benchmarks into open entities, imbalanced labels, and multimodal records in an open environment. Experimental results demonstrate that the assumptions made in the previous benchmark construction process are not coincidental with the open environment, which conceal the main challenges of the task and therefore significantly overestimate the current progress of entity matching. The constructed benchmarks and code are publicly released
翻译:实体匹配(EM)是实体解决(ER)的最关键步骤。虽然当前深层次的学习方法在标准EM基准上取得了令人印象深刻的业绩,但其现实世界应用绩效却令人十分沮丧。 在本文件中,我们强调,现实与理想之间的这种差距来自不合理的基准建设过程,这与实体匹配的性质不符,因此导致对当前EM方法的偏颇评价。为此,我们建立了一个新的EM文库和重新构建EM基准,以挑战先前基准建设过程中隐含的关键假设,方法是以渐进的方式将限制实体、平衡标签和以往基准中的单一模式记录转变为开放实体、不平衡标签和开放环境中的多式联运记录。实验结果表明,以往基准建设过程中所作的假设与开放环境不相吻合,而开放环境掩盖了任务的主要挑战,因此大大高估了当前实体匹配的进展。已构建的基准和代码被公开发布。