The increasing availability of structured datasets, from Web tables and open-data portals to enterprise data, opens up opportunities~to enrich analytics and improve machine learning models through relational data augmentation. In this paper, we introduce a new class of data augmentation queries: join-correlation queries. Given a column $Q$ and a join column $K_Q$ from a query table $\mathcal{T}_Q$, retrieve tables $\mathcal{T}_X$ in a dataset collection such that $\mathcal{T}_X$ is joinable with $\mathcal{T}_Q$ on $K_Q$ and there is a column $C \in \mathcal{T}_X$ such that $Q$ is correlated with $C$. A na\"ive approach to evaluate these queries, which first finds joinable tables and then explicitly joins and computes correlations between $Q$ and all columns of the discovered tables, is prohibitively expensive. To efficiently support correlated column discovery, we 1) propose a sketching method that enables the construction of an index for a large number of tables and that provides accurate estimates for join-correlation queries, and 2) explore different scoring strategies that effectively rank the query results based on how well the columns are correlated with the query. We carry out a detailed experimental evaluation, using both synthetic and real data, which shows that our sketches attain high accuracy and the scoring strategies lead to high-quality rankings.
翻译:从网络表格和开放数据门户到企业数据,结构化数据集的可用性不断增加,从网络表格和开放数据门户到企业数据,为通过关系数据增强来丰富分析和改进机器学习模式提供了机会。在本文件中,我们引入了一个新的数据增强查询类别:联合关系查询。根据一个查询表格$\mathcal{T ⁇ }$的一列Q$和一列美元,从查询表格$\mathcal{T ⁇ {T ⁇ X$中检索一个数据集收集的表格$mathcal{T ⁇ X$可以与$mathcal{T ⁇ $($@T ⁇ $$$$$)相联,并且有1列美元C\ in\\\ mathcal{T ⁇ X$($$),这样一列美元与美元相联。一个用来评价这些查询的“质量”方法首先找到可加入的表格,然后明确结合和计算美元与所发现表格所有各栏之间的关联性,费用太高。为了高效率地支持相关的标签发现,我们1)建议一种草图方法,以便能够构建一个指数指数指数,用来构建一个庞大的图表, 和精确的升级的图表,用来显示我们所使用的图表, 并进行精确的图表。