Finding joinable tables in data lakes is key procedure in many applications such as data integration, data augmentation, data analysis, and data market. Traditional approaches that find equi-joinable tables are unable to deal with misspellings and different formats, nor do they capture any semantic joins. In this paper, we propose PEXESO, a framework for joinable table discovery in data lakes. We embed textual values as high-dimensional vectors and join columns under similarity predicates on high-dimensional vectors, hence to address the limitations of equi-join approaches and identify more meaningful results. To efficiently find joinable tables with similarity, we propose a block-and-verify method that utilizes pivot-based filtering. A partitioning technique is developed to cope with the case when the data lake is large and the index cannot fit in main memory. An experimental evaluation on real datasets shows that our solution identifies substantially more tables than equi-joins and outperforms other similarity-based options, and the join results are useful in data enrichment for machine learning tasks. The experiments also demonstrate the efficiency of the proposed method.
翻译:在数据湖泊中寻找可连接的表格是数据集成、数据增强、数据分析和数据市场等许多应用中的关键程序。找到可等同表格的传统方法无法处理拼写错误和不同格式,也无法捕捉任何语义组合。在本文件中,我们提议PEXESO,这是在数据湖泊中寻找可连接的表格的框架。我们将文本值作为高维矢量嵌入高维矢量的类似假设下,并加入高维矢量的类似假设下的列内,从而解决equi-join方法的局限性,并找出更有意义的结果。为了高效率地找到具有相似性的可连接表格,我们建议了使用基于线性过滤法的块和验证方法。在数据湖大且索引无法与主记忆相匹配的情况下,我们开发了一种分隔技术来应对这个案例。对真实数据集的实验性评估表明,我们的解决方案确定了比equijoins要多得多的表格,并超越了其他类似的选项,而合并结果在数据浓缩中对于机器学习任务也很有用。实验还证明了拟议方法的效率。