Blocking is a mechanism to improve the efficiency of Entity Resolution (ER) which aims to quickly prune out all non-matching record pairs. However, depending on the distributions of entity cluster sizes, existing techniques can be either (a) too aggressive, such that they help scale but can adversely affect the ER effectiveness, or (b) too permissive, potentially harming ER efficiency. In this paper, we propose a new methodology of progressive blocking (pBlocking) to enable both efficient and effective ER, which works seamlessly across different entity cluster size distributions. pBlocking is based on the insight that the effectiveness-efficiency trade-off is revealed only when the output of ER starts to be available. Hence, pBlocking leverages partial ER output in a feedback loop to refine the blocking result in a data-driven fashion. Specifically, we bootstrap pBlocking with traditional blocking methods and progressively improve the building and scoring of blocks until we get the desired trade-off, leveraging a limited amount of ER results as a guidance at every round. We formally prove that pBlocking converges efficiently ($O(n log^2 n)$ time complexity, where n is the total number of records). Our experiments show that incorporating partial ER output in a feedback loop can improve the efficiency and effectiveness of blocking by 5x and 60% respectively, improving the overall F-score of the entire ER process up to 60%.
翻译:封隔是一种提高实体分辨率效率的机制,目的是快速缩小所有不匹配的记录配对。然而,根据实体群集大小的分布,现有技术可以是:(a) 过于激进,因此它们有助于规模,但可能会对ER的有效性产生不利影响,或(b) 过于宽松,可能损害ER的效率。在本文件中,我们提出了一种渐进封隔(pblocking)的新方法(pblocking),以便在不同的实体群集大小分布之间无缝地发挥作用。pBlocking是基于这样的洞察力,即只有开始提供ER的输出时,才能披露效率-60交易。因此,在反馈环中,PBlocking利用部分ER输出,以便以数据驱动的方式改进封隔结果。具体地说,我们用传统的封隔断方法捆绑,逐步改善块的建筑和评分,直到我们得到理想的交换,利用有限数量的ER结果作为每一轮的指南。我们正式证明,PBlock-e-e-e-eal-effer-off off ex ex-n-n-n-n-ral ligenal resmillational disal disal dislational disal dislislislislation 和nxxxxxxxx可以分别改进整个Rxxxxxxxxxx 的精度的精精精精度记录。