When combining data from multiple sources, inconsistent data complicates the production of a coherent result. In this paper, we introduce a new type of constraints called edit rules under a partial key (EPKs). These constraints can model inconsistencies both within and between sources, but in a loosely-coupled matter. We show that we can adapt the well-known set cover methodology to the setting of EPKs and this yields an efficient algorithm to find minimal cost repairs of sources. This algorithm is implemented in a repair engine called Parker. Empirical results show that Parker is several orders of magnitude faster than state-of-the-art repair tools. At the same time, the quality of the repairs in terms of $F_1$-score ranges from comparable to better compared to these tools.
翻译:当将来自多个来源的数据结合起来时,不一致的数据会使得出一致的结果变得复杂。在本文中,我们引入了一种新型的制约因素,称为部分钥匙下的编辑规则。这些制约因素可以模拟源内和源间不一致的情况,但是一个松散的混合物质。我们表明,我们可以将众所周知的一套覆盖方法适应于源码的设置,从而产生一种高效率的算法,以找到最低成本的源修理。这种算法是在一个叫做Parker的修理引擎中实施的。经验性结果显示,Parker比最先进的修理工具要快几个数量级。与此同时,以1美元计的芯的修理质量也比这些工具高得多。