学会成为统计员:不同价值数目的计算者 (Learning to be a Statistician: Learned Estimator for Number of Distinct Values) - 专知论文

会员服务 ·

0

估计/估计量 · 学成 · Performer · MoDELS · 训练数据 ·

2022 年 2 月 6 日

Learning to be a Statistician: Learned Estimator for Number of Distinct Values

翻译：学会成为统计员:不同价值数目的计算者

Renzhi Wu,Bolin Ding,Xu Chu,Zhewei Wei,Xiening Dai,Tao Guan,Jingren Zhou

from arxiv, Published at International Conference on Very Large Data Bases (VLDB) 2022

Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems, such as columnstore compression and data profiling. In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples. Such efficient estimation is critical for tasks where it is prohibitive to scan the data even once. Existing sample-based estimators typically rely on heuristics or assumptions and do not have robust performance across different datasets as the assumptions on data can easily break. On the other hand, deriving an estimator from a principled formulation such as maximum likelihood estimation is very challenging due to the complex structure of the formulation. We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator. To this end, we need to answer several questions: i) how to make the learned model workload agnostic; ii) how to obtain training data; iii) how to perform model training. We derive conditions of the learning framework under which the learned model is workload agnostic, in the sense that the model/estimator can be trained with synthetically generated training data, and then deployed into any data warehouse simply as, e.g., user-defined functions (UDFs), to offer efficient (within microseconds on CPU) and accurate NDV estimations for unseen tables and workloads. We compare the learned estimator with the state-of-the-art sample-based estimators on nine real-world datasets to demonstrate its superior estimation accuracy. We publish our code for training data generation, model training, and the learned estimator online for reproducibility.

翻译：在这项工作中,我们的重点是如何从随机(在线/离线)样本中得出准确的NDV估算。这种高效估算对于即使一次扫描数据也令人望而却步的任务至关重要。现有基于抽样的估算器通常依赖超常或假设,而且由于数据假设很容易破碎,不同数据集的估算器没有强大的性能。另一方面,由于设计结构复杂,从诸如最大可能性估算等原则性公式中得出一个估算器非常困难。我们提议在监督的学习框架内制定NDV估算任务,目的是学习模型作为测算器。为此,我们需要回答几个问题:一) 如何使所学模型工作量具有可比性;二(二) 如何获得培训数据;三) 如何进行模型培训。我们从一个有原则的公式下得出一个估算器框架,即准确的估算值是9。我们从该模型的准确性估算值中得出一个估算器的估算器,从一个受监督的模型中得出其估算任务,以学习模型作为测算的模型。

0

相关内容

估计/估计量

估计/估计量

计算机科学课程与视频课件合集，Computer Science courses with video lectures

计算机科学课程与视频课件合集，Computer Science courses with video lectures

专知会员服务

37+阅读 · 2022年1月24日

哥伦比亚大学最新《机器学习》课程，Fall-B 2020 (Machine Learning)

专知会员服务

39+阅读 · 2020年11月3日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

MIT-深度学习Deep Learning State of the Art in 2020，87页ppt

MIT-深度学习Deep Learning State of the Art in 2020，87页ppt

专知会员服务

62+阅读 · 2020年2月17日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

163+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

ACM TOMM Call for Papers

ACM TOMM Call for Papers

CCF多媒体专委会

2+阅读 · 2022年3月23日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium9

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium9

中国图象图形学学会CSIG

0+阅读 · 2021年12月17日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium3

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium3

中国图象图形学学会CSIG

0+阅读 · 2021年11月9日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium2

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium2

中国图象图形学学会CSIG

0+阅读 · 2021年11月8日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

【ICIG2021】Latest News & Announcements of the Industry Talk1

【ICIG2021】Latest News & Announcements of the Industry Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年7月28日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

Al-Cu-Mg-Mn-Zn体系关键四元系富Al端相图、液相面投影图的实验和热力学优化研究

国家自然科学基金

0+阅读 · 2014年12月31日

有限域上多项式的p-进与T-进指数和

国家自然科学基金

0+阅读 · 2013年12月31日

基于Universum学习的降维方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于计算智能的起重机金属结构可靠性稳健优化设计研究

国家自然科学基金

0+阅读 · 2012年12月31日

压缩传感中CS矩阵的构造理论与信号重构的快速算法

国家自然科学基金

0+阅读 · 2012年12月31日

Witten Laplacian的特征值及与其相关的Ricci Soliton研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于压缩感知的稀疏阵列MIMO-SAR成像及动目标检测

国家自然科学基金

0+阅读 · 2012年12月31日

关于图顶点划分的 Thomassen 猜想

国家自然科学基金

0+阅读 · 2011年12月31日

赋值理论与几何不等式的研究

国家自然科学基金

1+阅读 · 2011年12月31日

基于概率图模型的不确定数据管理

国家自然科学基金

2+阅读 · 2009年12月31日

Efficient Bayesian Policy Reuse with a Scalable Observation Model in Deep Reinforcement Learning

Arxiv

0+阅读 · 2022年4月19日

A Statistical Decision-Theoretical Perspective on the Two-Stage Approach to Parameter Estimation

Arxiv

0+阅读 · 2022年4月15日

Imposing Consistency for Optical Flow Estimation

Arxiv

0+阅读 · 2022年4月14日

Testing distributional assumptions of learning algorithms

Arxiv

0+阅读 · 2022年4月14日

Bayesian Deep Learning for Graphs

Arxiv

23+阅读 · 2022年2月24日

The Principles of Deep Learning Theory

Arxiv

66+阅读 · 2021年6月18日

The Causal Learning of Retail Delinquency

Arxiv

15+阅读 · 2020年12月17日

Meta-Learning to Cluster

Meta-Learning to Cluster

Arxiv

18+阅读 · 2019年10月30日

Learning Heuristics over Large Graphs via Deep Reinforcement Learning

Arxiv

12+阅读 · 2019年3月8日

Generative Adversarial Networks and Probabilistic Graph Models for Hyperspectral Image Classification

Arxiv

11+阅读 · 2018年2月10日

VIP会员

文章信息

相关主题

估计/估计量

相关VIP内容

计算机科学课程与视频课件合集，Computer Science courses with video lectures

计算机科学课程与视频课件合集，Computer Science courses with video lectures

专知会员服务

37+阅读 · 2022年1月24日

哥伦比亚大学最新《机器学习》课程，Fall-B 2020 (Machine Learning)

专知会员服务

39+阅读 · 2020年11月3日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

MIT-深度学习Deep Learning State of the Art in 2020，87页ppt

MIT-深度学习Deep Learning State of the Art in 2020，87页ppt

专知会员服务

62+阅读 · 2020年2月17日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

163+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《俄乌战争背景下俄罗斯的战略性海军分析（2022-2025年）》最新100页报告

【斯坦福博士论文】数据、决策与依赖：构建可信人工智能的挑战

人工智能时代背景下的未来海战

接触战中的无人机优势：美军旅级部队面临的小型无人机系统挑战与调整

相关资讯

ACM TOMM Call for Papers

ACM TOMM Call for Papers

CCF多媒体专委会

2+阅读 · 2022年3月23日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium9

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium9

中国图象图形学学会CSIG

0+阅读 · 2021年12月17日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium3

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium3

中国图象图形学学会CSIG

0+阅读 · 2021年11月9日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium2

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium2

中国图象图形学学会CSIG

0+阅读 · 2021年11月8日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

【ICIG2021】Latest News & Announcements of the Industry Talk1

【ICIG2021】Latest News & Announcements of the Industry Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年7月28日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

相关论文

Efficient Bayesian Policy Reuse with a Scalable Observation Model in Deep Reinforcement Learning

Arxiv

0+阅读 · 2022年4月19日

A Statistical Decision-Theoretical Perspective on the Two-Stage Approach to Parameter Estimation

Arxiv

0+阅读 · 2022年4月15日

Imposing Consistency for Optical Flow Estimation

Arxiv

0+阅读 · 2022年4月14日

Testing distributional assumptions of learning algorithms

Arxiv

0+阅读 · 2022年4月14日

Bayesian Deep Learning for Graphs

Arxiv

23+阅读 · 2022年2月24日

The Principles of Deep Learning Theory

Arxiv

66+阅读 · 2021年6月18日

The Causal Learning of Retail Delinquency

Arxiv

15+阅读 · 2020年12月17日

Meta-Learning to Cluster

Meta-Learning to Cluster

Arxiv

18+阅读 · 2019年10月30日

Learning Heuristics over Large Graphs via Deep Reinforcement Learning

Arxiv

12+阅读 · 2019年3月8日

Generative Adversarial Networks and Probabilistic Graph Models for Hyperspectral Image Classification

Arxiv

11+阅读 · 2018年2月10日

相关基金

Al-Cu-Mg-Mn-Zn体系关键四元系富Al端相图、液相面投影图的实验和热力学优化研究

国家自然科学基金

0+阅读 · 2014年12月31日

有限域上多项式的p-进与T-进指数和

国家自然科学基金

0+阅读 · 2013年12月31日

基于Universum学习的降维方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于计算智能的起重机金属结构可靠性稳健优化设计研究

国家自然科学基金

0+阅读 · 2012年12月31日

压缩传感中CS矩阵的构造理论与信号重构的快速算法

国家自然科学基金

0+阅读 · 2012年12月31日

Witten Laplacian的特征值及与其相关的Ricci Soliton研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于压缩感知的稀疏阵列MIMO-SAR成像及动目标检测

国家自然科学基金

0+阅读 · 2012年12月31日

关于图顶点划分的 Thomassen 猜想

国家自然科学基金

0+阅读 · 2011年12月31日

赋值理论与几何不等式的研究

国家自然科学基金

1+阅读 · 2011年12月31日

基于概率图模型的不确定数据管理

国家自然科学基金

2+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员