Entity Resolution (ER) is the task of finding records that refer to the same real-world entities. A common scenario is when entities across two clean sources need to be resolved, which we refer to as Clean-Clean ER. In this paper, we perform an extensive empirical evaluation of 8 bipartite graph matching algorithms that take in as input a bipartite similarity graph and provide as output a set of matched entities. We consider a wide range of matching algorithms, including algorithms that have not previously been applied to ER, or have been evaluated only in other ER settings. We assess the relative performance of the algorithms with respect to accuracy and time efficiency over 10 established, real datasets, from which we extract >700 different similarity graphs. Our results provide insights into the relative performance of these algorithms and guidelines for choosing the best one, depending on the data at hand.
翻译:实体分辨率( ER) 是查找与真实世界实体相同的记录的任务。 一个共同的设想是,需要解决两个清洁来源的实体,我们称之为清洁ER。 在本文中,我们对8个双边图形匹配算法进行了广泛的实证评估,这些算法作为两边相似图输入,并提供一组匹配实体作为输出。 我们考虑了一系列广泛的匹配算法,包括以前没有应用到ER的算法,或仅在其他ER设置中评估过的算法。 我们评估了10个既定真实数据集在准确性和时间效率方面的相对性能,我们从中提取了 > 7000个不同的相似图。我们的结果为这些算法的相对性能和选择最佳算法的指导方针提供了深入的见解,这取决于手头的数据。