Entity Resolution suffers from quadratic time complexity. To increase its time efficiency, three kinds of filtering techniques are typically used for restricting its search space: (i) blocking workflows, which group together entity profiles with identical or similar signatures, (ii) string similarity join algorithms, which quickly detect entities more similar than a threshold, and (iii) nearest-neighbor methods, which convert every entity profile into a vector and quickly detect the closest entities according to the specified distance function. Numerous methods have been proposed for each type, but the literature lacks a comparative analysis of their relative performance. As we show in this work, this is a non-trivial task, due to the significant impact of configuration parameters on the performance of each filtering technique. We perform the first systematic experimental study that investigates the relative performance of the main methods per type over 10 real-world datasets. For each method, we consider a plethora of parameter configurations, optimizing it with respect to recall and precision. For each dataset, we consider both schema-agnostic and schema-based settings. The experimental results provide novel insights into the effectiveness and time efficiency of the considered techniques, demonstrating the superiority of blocking workflows and string similarity joins.
翻译:为了提高时间效率,通常使用三种过滤技术限制搜索空间:(一) 屏蔽工作流程,将具有相同或类似特征的实体剖面图组合在一起;(二) 串式相似算法,快速检测比阈值更相似的实体;(三) 近邻方法,将每个实体剖面图转换成向量,并根据指定的距离函数快速检测最接近的实体。为每种类型提出了许多方法,但文献缺乏对其相对性能的比较分析。正如我们在这项工作中所显示的那样,这是一项非三角任务,因为配置参数对每种过滤技术的性能有重大影响。我们进行了第一次系统实验研究,对10个真实世界数据集中每类主要方法的相对性能进行了调查。对于每一种方法,我们都会考虑过多的参数配置,在回顾和精确方面加以优化。对于每一种数据集,我们都会考虑其相对性能和基于化学环境的对比性能。实验结果提供了对各种筛选技术的效能和效率的新的洞察力,并展示了相似性能和高级性能。