Data from different sources rarely conform to a single formatting even if they describe the same set of entities, and this raises concerns when data from multiple sources must be joined or cross-referenced. Such a formatting mismatch is unavoidable when data is gathered from various public and third-party sources. Commercial database systems are not able to perform the join when there exist differences in data representation or formatting, and manual reformatting is both time consuming and error-prone. We study the problem of efficiently joining textual data under the condition that the join columns are not formatted the same and cannot be equi-joined, but they become joinable under some transformations. The problem is challenging simply because the number of possible transformations explodes with both the length of the input and the number of rows, even if each transformation is formed using very few basic units. We show that an efficient algorithm can be developed based on the common characteristics of the joined columns, and develop one such algorithm over a rich set of basic operations that can be composed to form transformations. We compare both the coverage and the running time of our algorithm to a state-of-the-art approach, and show that our algorithm covers every transformation that is covered in the state-of-the-art approach but is a few orders of magnitude faster, as evaluated on various real and synthetic data.
翻译:不同来源的数据即使描述同一组实体,也很少符合单一格式,这引起了对多个来源的数据必须合并或相互参照的关切。当从各种公共和第三方来源收集数据时,这种格式错配是不可避免的。商业数据库系统在数据表述或格式存在差异时无法进行合并,而人工改编既耗时又容易出错。我们研究在以下条件下高效合并文本数据的问题,即合并列的格式不同,不能互换,但在某些变换中,它们变得可以互为合用。问题之所以具有挑战性,只是因为可能的变换数量随着输入长度和行数的长度而爆发,即使每次变换都使用极少的基本单位组成。我们表明,高效的算法可以根据合并列的共同特点发展,并针对可构成变换的一套丰富的基本操作开发一种这样的算法。我们比较了算法的覆盖范围和运行时间,在一些变换中,我们将算法的覆盖范围和运行时间与最先进的方法相比较,但显示,每个变换法都覆盖了每个变法的变法的进度。我们算法的进度都比较得更快。