近似合并校正查询的关联密件器 (Correlation Sketches for Approximate Join-Correlation Queries) - 专知论文

会员服务 ·

0

相关系数 · 列 · 数据增强 · 近似 · 估计/估计量 ·

2021 年 4 月 7 日

Correlation Sketches for Approximate Join-Correlation Queries

翻译：近似合并校正查询的关联密件器

Aécio Santos,Aline Bessa,Fernando Chirigati,Christopher Musco,Juliana Freire

from arxiv, Proceedings of the 2021 International Conference on Management of Data (SIGMOD '21)

The increasing availability of structured datasets, from Web tables and open-data portals to enterprise data, opens up opportunities~to enrich analytics and improve machine learning models through relational data augmentation. In this paper, we introduce a new class of data augmentation queries: join-correlation queries. Given a column $Q$ and a join column $K_Q$ from a query table $\mathcal{T}_Q$, retrieve tables $\mathcal{T}_X$ in a dataset collection such that $\mathcal{T}_X$ is joinable with $\mathcal{T}_Q$ on $K_Q$ and there is a column $C \in \mathcal{T}_X$ such that $Q$ is correlated with $C$. A na\"ive approach to evaluate these queries, which first finds joinable tables and then explicitly joins and computes correlations between $Q$ and all columns of the discovered tables, is prohibitively expensive. To efficiently support correlated column discovery, we 1) propose a sketching method that enables the construction of an index for a large number of tables and that provides accurate estimates for join-correlation queries, and 2) explore different scoring strategies that effectively rank the query results based on how well the columns are correlated with the query. We carry out a detailed experimental evaluation, using both synthetic and real data, which shows that our sketches attain high accuracy and the scoring strategies lead to high-quality rankings.

翻译：从网络表格和开放数据门户到企业数据,结构化数据集的可用性不断增加,从网络表格和开放数据门户到企业数据,为通过关系数据增强来丰富分析和改进机器学习模式提供了机会。在本文件中,我们引入了一个新的数据增强查询类别:联合关系查询。根据一个查询表格$\mathcal{T ⁇ }$的一列Q$和一列美元,从查询表格$\mathcal{T ⁇ {T ⁇ X$中检索一个数据集收集的表格$mathcal{T ⁇ X$可以与$mathcal{T ⁇ $($@T ⁇ $$$$$)相联,并且有1列美元C\ in\\\ mathcal{T ⁇ X$($$),这样一列美元与美元相联。一个用来评价这些查询的“质量”方法首先找到可加入的表格,然后明确结合和计算美元与所发现表格所有各栏之间的关联性,费用太高。为了高效率地支持相关的标签发现,我们1)建议一种草图方法,以便能够构建一个指数指数指数,用来构建一个庞大的图表, 和精确的升级的图表,用来显示我们所使用的图表, 并进行精确的图表。

0

相关内容

相关系数

【MIT】自监督几何感知，22页ppt，Self-supervised Geometric Perception

【MIT】自监督几何感知，22页ppt，Self-supervised Geometric Perception

专知会员服务

23+阅读 · 2021年6月3日

多样性算力技术愿景白皮书

多样性算力技术愿景白皮书

专知会员服务

85+阅读 · 2021年4月29日

【文本生成现代方法】Modern Methods for Text Generation

【文本生成现代方法】Modern Methods for Text Generation

专知会员服务

44+阅读 · 2020年9月11日

【SIGIR2020】高效查询自动补全，Efficient and Effective Query Auto-Completion

【SIGIR2020】高效查询自动补全，Efficient and Effective Query Auto-Completion

专知会员服务

10+阅读 · 2020年5月14日

【ACL2020】对抗性文本生成，Improving Adversarial Text Generation

专知会员服务

52+阅读 · 2020年5月5日

【微软雷德蒙研究院】小样本自然语言生成，Few-shot Natural Language Generation for Task-Oriented Dialog

【微软雷德蒙研究院】小样本自然语言生成，Few-shot Natural Language Generation for Task-Oriented Dialog

专知会员服务

33+阅读 · 2020年2月29日

【ML课程】多变量微积分（Multivariable Calculus），加州大学伯克利分校| Prof. Denis Auroux

【ML课程】多变量微积分（Multivariable Calculus），加州大学伯克利分校| Prof. Denis Auroux

专知会员服务

10+阅读 · 2020年1月7日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

LibRec 精选：AutoML for Contextual Bandits

LibRec 精选：AutoML for Contextual Bandits

LibRec智能推荐

7+阅读 · 2019年9月19日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Call for Participation: Shared Tasks in NLPCC 2019

Call for Participation: Shared Tasks in NLPCC 2019

中国计算机学会

5+阅读 · 2019年3月22日

论文浅尝 | Global Relation Embedding for Relation Extraction

论文浅尝 | Global Relation Embedding for Relation Extraction

开放知识图谱

12+阅读 · 2019年3月3日

Disentangled的假设的探讨

Disentangled的假设的探讨

CreateAMind

9+阅读 · 2018年12月10日

【论文推荐】最新六篇行人再识别（ReID）相关论文—和谐注意力网络、时序残差学习、评估和基准、图像生成、三元组、对抗属性-图像

【论文推荐】最新六篇行人再识别（ReID）相关论文—和谐注意力网络、时序残差学习、评估和基准、图像生成、三元组、对抗属性-图像

专知

9+阅读 · 2018年3月1日

【论文推荐】最新6篇行人重识别相关论文—深度空间特征重构、生成对抗网络、图像生成、系列实战、图像-图像域自适应方法、行人检索

【论文推荐】最新6篇行人重识别相关论文—深度空间特征重构、生成对抗网络、图像生成、系列实战、图像-图像域自适应方法、行人检索

专知

5+阅读 · 2018年1月21日

论文浅尝 | Improved Neural Relation Detection for KBQA

论文浅尝 | Improved Neural Relation Detection for KBQA

开放知识图谱

13+阅读 · 2018年1月21日

深度文本匹配开源工具（MatchZoo）

深度文本匹配开源工具（MatchZoo）

中国科学院网络数据重点实验室

7+阅读 · 2017年12月5日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

Probing for the Trace Estimation of a Permuted Matrix Inverse Corresponding to a Lattice Displacement

Arxiv

0+阅读 · 2021年6月2日

Accurate and Efficient Time Series Matching by Season- and Trend-aware Symbolic Approximation -- Extended Version Including Additional Evaluation and Proofs

Arxiv

0+阅读 · 2021年6月1日

Junta Distance Approximation with Sub-Exponential Queries

Arxiv

0+阅读 · 2021年6月1日

Generating Query Focused Summaries from Query-Free Resources

Arxiv

1+阅读 · 2021年5月31日

Tests and estimation strategies associated to some loss functions

Arxiv

0+阅读 · 2021年5月31日

Optimal covariance matrix estimation for high-dimensional noise in high-frequency data

Arxiv

0+阅读 · 2021年5月30日

Graph-based Hierarchical Relevance Matching Signals for Ad-hoc Retrieval

Arxiv

10+阅读 · 2021年2月22日

Testing Matrix Rank, Optimally

Arxiv

3+阅读 · 2018年10月18日

Deep Spatial Feature Reconstruction for Partial Person Re-identification: Alignment-Free Approach

Arxiv

9+阅读 · 2018年1月3日

Practical sketching algorithms for low-rank matrix approximation

Arxiv

4+阅读 · 2018年1月2日

VIP会员

文章信息

相关主题

估计/估计量

相关VIP内容

【MIT】自监督几何感知，22页ppt，Self-supervised Geometric Perception

【MIT】自监督几何感知，22页ppt，Self-supervised Geometric Perception

专知会员服务

23+阅读 · 2021年6月3日

多样性算力技术愿景白皮书

多样性算力技术愿景白皮书

专知会员服务

85+阅读 · 2021年4月29日

【文本生成现代方法】Modern Methods for Text Generation

【文本生成现代方法】Modern Methods for Text Generation

专知会员服务

44+阅读 · 2020年9月11日

【SIGIR2020】高效查询自动补全，Efficient and Effective Query Auto-Completion

【SIGIR2020】高效查询自动补全，Efficient and Effective Query Auto-Completion

专知会员服务

10+阅读 · 2020年5月14日

【ACL2020】对抗性文本生成，Improving Adversarial Text Generation

专知会员服务

52+阅读 · 2020年5月5日

【微软雷德蒙研究院】小样本自然语言生成，Few-shot Natural Language Generation for Task-Oriented Dialog

【微软雷德蒙研究院】小样本自然语言生成，Few-shot Natural Language Generation for Task-Oriented Dialog

专知会员服务

33+阅读 · 2020年2月29日

【ML课程】多变量微积分（Multivariable Calculus），加州大学伯克利分校| Prof. Denis Auroux

【ML课程】多变量微积分（Multivariable Calculus），加州大学伯克利分校| Prof. Denis Auroux

专知会员服务

10+阅读 · 2020年1月7日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

人机协同作战规划：来自美海军陆战队的大语言模型（LLM）使用教训

对北约军事总部战略规划制定与实施的研究 | 140页

美联参会指南-联合规划与执行概述及政策框架 | 32页

俄罗斯军事规划差异性凸显其思维的重要性 | 2025最新文献

相关资讯

LibRec 精选：AutoML for Contextual Bandits

LibRec 精选：AutoML for Contextual Bandits

LibRec智能推荐

7+阅读 · 2019年9月19日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Call for Participation: Shared Tasks in NLPCC 2019

Call for Participation: Shared Tasks in NLPCC 2019

中国计算机学会

5+阅读 · 2019年3月22日

论文浅尝 | Global Relation Embedding for Relation Extraction

论文浅尝 | Global Relation Embedding for Relation Extraction

开放知识图谱

12+阅读 · 2019年3月3日

Disentangled的假设的探讨

Disentangled的假设的探讨

CreateAMind

9+阅读 · 2018年12月10日

【论文推荐】最新六篇行人再识别（ReID）相关论文—和谐注意力网络、时序残差学习、评估和基准、图像生成、三元组、对抗属性-图像

【论文推荐】最新六篇行人再识别（ReID）相关论文—和谐注意力网络、时序残差学习、评估和基准、图像生成、三元组、对抗属性-图像

专知

9+阅读 · 2018年3月1日

【论文推荐】最新6篇行人重识别相关论文—深度空间特征重构、生成对抗网络、图像生成、系列实战、图像-图像域自适应方法、行人检索

【论文推荐】最新6篇行人重识别相关论文—深度空间特征重构、生成对抗网络、图像生成、系列实战、图像-图像域自适应方法、行人检索

专知

5+阅读 · 2018年1月21日

论文浅尝 | Improved Neural Relation Detection for KBQA

论文浅尝 | Improved Neural Relation Detection for KBQA

开放知识图谱

13+阅读 · 2018年1月21日

深度文本匹配开源工具（MatchZoo）

深度文本匹配开源工具（MatchZoo）

中国科学院网络数据重点实验室

7+阅读 · 2017年12月5日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

相关论文

Probing for the Trace Estimation of a Permuted Matrix Inverse Corresponding to a Lattice Displacement

Arxiv

0+阅读 · 2021年6月2日

Accurate and Efficient Time Series Matching by Season- and Trend-aware Symbolic Approximation -- Extended Version Including Additional Evaluation and Proofs

Arxiv

0+阅读 · 2021年6月1日

Junta Distance Approximation with Sub-Exponential Queries

Arxiv

0+阅读 · 2021年6月1日

Generating Query Focused Summaries from Query-Free Resources

Arxiv

1+阅读 · 2021年5月31日

Tests and estimation strategies associated to some loss functions

Arxiv

0+阅读 · 2021年5月31日

Optimal covariance matrix estimation for high-dimensional noise in high-frequency data

Arxiv

0+阅读 · 2021年5月30日

Graph-based Hierarchical Relevance Matching Signals for Ad-hoc Retrieval

Arxiv

10+阅读 · 2021年2月22日

Testing Matrix Rank, Optimally

Arxiv

3+阅读 · 2018年10月18日

Deep Spatial Feature Reconstruction for Partial Person Re-identification: Alignment-Free Approach

Arxiv

9+阅读 · 2018年1月3日

Practical sketching algorithms for low-rank matrix approximation

Arxiv

4+阅读 · 2018年1月2日

微信扫码咨询专知VIP会员