Here we study the problem of matched record clustering in unsupervised entity resolution. We build upon a state-of-the-art probabilistic framework named the Data Washing Machine (DWM). We introduce a graph-based hierarchical 2-step record clustering method (GDWM) that first identifies large, connected components or, as we call them, soft clusters in the matched record pairs using a graph-based transitive closure algorithm utilized in the DWM. That is followed by breaking down the discovered soft clusters into more precise entity clusters in a hierarchical manner using an adapted graph-based modularity optimization method. Our approach provides several advantages over the original implementation of the DWM, mainly a significant speed-up, increased precision, and overall increased F1 scores. We demonstrate the efficacy of our approach using experiments on multiple synthetic datasets. Our results also provide evidence of the utility of graph theory-based algorithms despite their sparsity in the literature on unsupervised entity resolution.
翻译:在此,我们研究在未经监督的实体分辨率中匹配记录组群的问题。 我们以数据清洗机(DWM)这一最新概率框架为基础。 我们引入了一种基于图表的等级级2级记录组群方法(GDWM),该方法首先使用DWM中使用的基于图表的中转封闭算法,确定大型、连接组件,或者我们称之为对匹配记录组群中的软组群。 之后,我们用一个经调整的基于图表的模块化优化方法,以等级方式将发现的软组群分解成更精确的实体群。 我们的方法为最初实施DWM提供了一些优势,主要是大幅度的加速、更精确和总体增加的F1分数。我们展示了我们使用多个合成数据集实验方法的功效。 我们的结果还提供了图表理论算法的效用证据,尽管这些算法在非监管实体分辨率的文献中非常集中。