The large size and fast growth of data repositories, such as data lakes, has spurred the need for data discovery to help analysts find related data. The problem has become challenging as (i) a user typically does not know what datasets exist in an enormous data repository; and (ii) there is usually a lack of a unified data model to capture the interrelationships between heterogeneous datasets from disparate sources. In this work, we address one important class of discovery needs: finding union-able tables. The task is to find tables in a data lake that can be unioned with a given query table. The challenge is to recognize union-able columns even if they are represented differently. In this paper, we propose a data-driven learning approach: specifically, an unsupervised representation learning and embedding retrieval task. Our key idea is to exploit self-supervised contrastive learning to learn an embedding model that takes into account the indexing/search data structure and produces embeddings close by for columns with semantically similar values while pushing apart columns with semantically dissimilar values. We then find union-able tables based on similarities between their constituent columns in embedding space. On a real-world data lake, we demonstrate that our best-performing model achieves significant improvements in precision ($16\% \uparrow$), recall ($17\% \uparrow $), and query response time (7x faster) compared to the state-of-the-art.
翻译:数据储存库(如数据湖)的庞大规模和快速增长刺激了数据发现的必要性,以帮助分析家找到相关数据。问题已经变得具有挑战性,因为(一) 用户通常不知道庞大的数据储存库中存在哪些数据集;以及(二) 通常缺乏统一的数据模型,以捕捉来自不同来源的不同数据集之间的相互关系。在这项工作中,我们处理一个重要的发现需要类别:找到可加入联盟的表格。任务在于在一个数据湖中找到能够与某个查询表结合的数据湖中找到表格。挑战在于识别可加入联盟的列,即使它们有不同的代表。在本文件中,我们提出一种数据驱动的学习方法:具体地说,一个不受监督的代言学习和嵌入检索任务。我们的主要想法是利用自我监督的对比学习来学习一个嵌入模型,该模型将考虑到索引/搜索数据结构,并产生与精度相似值相近的列的嵌入。我们随后发现联盟可加入的表格,基于在嵌入空间的(47xxx) 的构建列之间的相似性对比。