Data analytics stands to benefit from the increasing availability of datasets that are held without their conceptual relationships being explicitly known. When collected, these datasets form a data lake from which, by processes like data wrangling, specific target datasets can be constructed that enable value-adding analytics. Given the potential vastness of such data lakes, the issue arises of how to pull out of the lake those datasets that might contribute to wrangling out a given target. We refer to this as the problem of dataset discovery in data lakes and this paper contributes an effective and efficient solution to it. Our approach uses features of the values in a dataset to construct hash-based indexes that map those features into a uniform distance space. This makes it possible to define similarity distances between features and to take those distances as measurements of relatedness w.r.t. a target table. Given the latter (and exemplar tuples), our approach returns the most related tables in the lake. We provide a detailed description of the approach and report on empirical results for two forms of relatedness (unionability and joinability) comparing them with prior work, where pertinent, and showing significant improvements in all of precision, recall, target coverage, indexing and discovery times.
翻译:数据分析学将受益于越来越多的数据集的可得性,而这些数据集的存取没有明确认识它们的概念关系。在收集时,这些数据集将形成一个数据湖,通过数据交织等过程,可以从中建立具体的目标数据集,以便能够进行增值分析。鉴于这类数据湖的潜在广度,问题在于如何将这些数据集从湖中拉出,从而可能有助于拉出某一目标。我们将此称为在数据湖中发现数据集的问题,本文则有助于有效和高效地解决这一问题。我们的方法使用数据集中数值的特征来构建基于散列的指数,将这些特征映射成一个统一的距离空间。这样可以界定这些特征之间的相似距离,并将这些距离作为相关程度(r.t.)的一个目标表格。鉴于后者(和外号图),我们的方法返回了湖中最相关的表格。我们详细描述了方法和报告两种相关(粘合性和共性)形式的实验结果,将其与先前的发现范围进行比较,同时显示所有相关时间的显著改进和精确度。