Dataset discovery from data lakes is essential in many real application scenarios. In this paper, we propose Starmie, an end-to-end framework for dataset discovery from data lakes (with table union search as the main use case). Our proposed framework features a contrastive learning method to train column encoders from pre-trained language models in a fully unsupervised manner. The column encoder of Starmie captures the rich contextual semantic information within tables by leveraging a contrastive multi-column pre-training strategy. We utilize the cosine similarity between column embedding vectors as the column unionability score and propose a filter-and-verification framework that allows exploring a variety of design choices to compute the unionability score between two tables accordingly. Empirical evaluation results on real table benchmark datasets show that Starmie outperforms the best-known solutions in the effectiveness of table union search by 6.8 in MAP and recall. Moreover, Starmie is the first to employ the HNSW (Hierarchical Navigable Small World) index for accelerate query processing of table union search which provides a 3,000X performance gain over the linear scan baseline and a 400X performance gain over an LSH index (the state-of-the-art solution for data lake indexing).
翻译:从数据湖中发现数据数据集在许多真实的应用情景中至关重要。 在本文中, 我们提议 Starmie 是一个从数据湖中发现数据集的端到端框架( 以表格联盟搜索作为主要使用实例 ) 。 我们提议的框架采用了一种对比式学习方法, 以完全不受监督的方式, 培训从受过训练的语言模型中得出的柱形编码器。 Starmie 的柱形编码器利用对比式多柱形训练前战略, 捕捉表格内丰富的背景语义信息 。 我们用柱形嵌入矢量之间的共生相似性作为列连结性评分, 并提议一个过滤和验证框架, 以便探索各种设计选择, 来相应地计算两个表格之间的连结性评分。 真实表格基准数据集的实证评估结果表明, Starmie( Starmieme) 超越了表团搜索效果中最著名的解决方案, 在MAP 和回想中, 。 此外, Starmi 首次使用 HNSW( 高级导航小世界) 索引, 来加速表格联盟的查询处理, 从而获得400X 指数 的性指数, 在Lxx 上获得了 的成绩指数 基线和LX 的成绩指数上取得 。