Data discovery is a major challenge in enterprise data analysis: users often struggle to find data relevant to their analysis goals or even to navigate through data across data sources, each of which may easily contain thousands of tables. One common user need is to discover tables joinable with a given table. This need is particularly critical because join is a ubiquitous operation in data analysis, and join paths are mostly obscure to users, especially across databases. Furthermore, users are typically interested in finding ``semantically'' joinable tables: with columns that can be transformed to become joinable even if they are not joinable as currently represented in the data store. We present WarpGate, a system prototype for data discovery over cloud data warehouses. WarpGate implements an embedding-based solution to semantic join discovery, which encodes columns into high-dimensional vector space such that joinable columns map to points that are near each other. Through experiments on several table corpora, we show that WarpGate (i) captures semantic relationships between tables, especially those across databases, and (ii) is sample efficient and thus scalable to very large tables of millions of rows. We also showcase an application of WarpGate within an enterprise product for cloud data analytics.
翻译:在企业数据分析中,数据发现是一项重大挑战:用户往往努力寻找与其分析目标相关的数据,甚至通过数据源之间的数据导航,其中每个数据源都可能容易包含数千张表格。一个共同用户需要的是发现与一个指定表格相匹配的表格。这种需要特别关键,因为联合是一个无处不在的数据分析操作,而连接路径对于用户来说大多是模糊的,特别是对于数据库之间。此外,用户通常有兴趣寻找“模拟”的可加入的表格:如果列可以转换为可加入,即使它们不象数据存储处目前所代表的那样。我们展示了WarpGate(WarpGate),这是一个用于在云数据存储库中发现数据的原型系统。WarpGate(WarpGate)将基于嵌入的解决方案嵌入到语义学的发现中,它将列编码成高方位矢量的矢量空间,将列图连接到彼此相近的点。此外,我们通过几个表的实验显示WarpGate (i) 能够捕捉到各个表格之间的语义关系,特别是跨数据库的表格,以及(ii) 我们也可以将一个基于千兆的磁图显示一个非常大的磁图。