高效率的分类、重复的清除、分组和汇总 (Efficient sorting, duplicate removal, grouping, and aggregation) - 专知论文

会员服务 ·

0

输出 · 哈希学习 · INFORMS · 分解的 · Performer ·

2022 年 9 月 24 日

Efficient sorting, duplicate removal, grouping, and aggregation

翻译：高效率的分类、重复的清除、分组和汇总

Thanh Do,Goetz Graefe,Jeffrey Naughton

Database query processing requires algorithms for duplicate removal, grouping, and aggregation. Three algorithms exist: in-stream aggregation is most efficient by far but requires sorted input; sort-based aggregation relies on external merge sort; and hash aggregation relies on an in-memory hash table plus hash partitioning to temporary storage. Cost-based query optimization chooses which algorithm to use based on several factors including the sort order of the input, input and output sizes, and the need for sorted output. For example, hash-based aggregation is ideal for output smaller than the available memory (e.g., TPC-H Query 1), whereas sorting the entire input and aggregating after sorting are preferable when both aggregation input and output are large and the output needs to be sorted for a subsequent operation such as a merge join. Unfortunately, the size information required for a sound choice is often inaccurate or unavailable during query optimization, leading to sub-optimal algorithm choices. In response, this paper introduces a new algorithm for sort-based duplicate removal, grouping, and aggregation. The new algorithm always performs at least as well as both traditional hash-based and traditional sort-based algorithms. It can serve as a system's only aggregation algorithm for unsorted inputs, thus preventing erroneous algorithm choices. Furthermore, the new algorithm produces sorted output that can speed up subsequent operations. Google's F1 Query uses the new algorithm in production workloads that aggregate petabytes of data every day.

翻译：数据库查询处理需要重复清除、分组和汇总的算法。存在三种算法: 流中汇总效率最高, 但需要分类输入; 基于排序的汇总依赖于外部合并排序; 散列汇总依赖于一个内模的散列表, 加上散列分割到临时存储。基于成本的查询优化选择了基于若干因素使用的算法, 包括输入、输入和输出大小的排序顺序, 以及排序输出的需要。例如, 基于散列的汇总对于小于现有记忆的产出( 例如, TPC- H Query 1)是理想的, 而排序后的整批次整合依赖于外部合并排序则更可取; 当聚合的输入和输出都很大, 而后排序则需要将产出排序到一个类似合并的操作中。不幸的是, 在查询优化过程中, 需要的任意选择所需的大小信息往往不准确或不可用, 导致子优化的算法选择。对此, 本文为基于排序的重复清除、组合和汇总, 新的算法总是至少以总算法方式进行一次的排序,, 因此, 也只能用一种错误的算法的算法。

0

相关内容

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

INRIA最新「机器学习理论」新书，229页pdf原理性阐述机器学习

INRIA最新「机器学习理论」新书，229页pdf原理性阐述机器学习

专知会员服务

69+阅读 · 2021年3月27日

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

51+阅读 · 2020年12月14日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

IEEE ICKG 2022: Call for Papers

IEEE ICKG 2022: Call for Papers

机器学习与推荐算法

3+阅读 · 2022年3月30日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

IEEE TII Call For Papers

IEEE TII Call For Papers

CCF多媒体专委会

3+阅读 · 2022年3月24日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

Call for Nominations: 2022 Multimedia Prize Paper Award

Call for Nominations: 2022 Multimedia Prize Paper Award

CCF多媒体专委会

0+阅读 · 2022年2月12日

局部学习的特征选择：Local-Learning-Based Feature Selection

局部学习的特征选择：Local-Learning-Based Feature Selection

我爱读PAMI

14+阅读 · 2019年9月20日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

Sestrin2/AMPK信号通路调控新生鼠缺氧缺血脑损伤细胞自噬的新机制

国家自然科学基金

0+阅读 · 2015年12月31日

BZT-BCT基复合薄膜磁电存储单元及其电致磁电效应研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于下丘脑弓状核-外侧隔核ghrelin神经通路探讨腹部推拿对摄食影响的机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

γ-Synuclein调控MAPK-ERK-JNK信号通路及细胞周期促进子宫内膜癌恶性进展的机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

LOC283683-NIPA1-BMPRII途径对胆固醇平衡和动脉粥样硬化的影响及机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

REGγ在多发性骨髓瘤中的作用及分子机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

Wnt/β-catenin通路介导RELMβ调控糖尿病肾病系膜细胞增殖的机制研究

国家自然科学基金

1+阅读 · 2013年12月31日

SH2B1上调β-catenin 磷酸化增强Wnt/β-cat通路在NSCLC的EMT中作用及机制

国家自然科学基金

0+阅读 · 2013年12月31日

Ghrelin抑制胸腺脂肪细胞生成的分子调控网络研究

国家自然科学基金

0+阅读 · 2012年12月31日

Witten Laplacian的特征值及与其相关的Ricci Soliton研究

国家自然科学基金

0+阅读 · 2012年12月31日

Multi-Network Joint Source-Channel Coding Over Varying Noise Channels

Arxiv

0+阅读 · 2022年11月1日

kt-Safety: Graph Release via k-Anonymity and t-Closeness (Technical Report)

kt-Safety: Graph Release via k-Anonymity and t-Closeness (Technical Report)

Arxiv

0+阅读 · 2022年10月31日

HARRIS: Hybrid Ranking and Regression Forests for Algorithm Selection

Arxiv

0+阅读 · 2022年10月31日

uBFT: Microsecond-scale BFT using Disaggregated Memory

Arxiv

0+阅读 · 2022年10月31日

Fast and parallel decoding for transducer

Arxiv

0+阅读 · 2022年10月31日

BEPHAP: A Blockchain-Based Efficient Privacy-Preserving Handover Authentication Protocol with Key Agreement for Internet of Vehicles

Arxiv

0+阅读 · 2022年10月29日

Fast Inference for Quantile Regression with Millions of Observations

Arxiv

0+阅读 · 2022年10月28日

A case for disaggregation of ML data processing

Arxiv

0+阅读 · 2022年10月28日

Updating Embeddings for Dynamic Knowledge Graphs

Arxiv

20+阅读 · 2021年9月22日

Detect-to-Retrieve: Efficient Regional Aggregation for Image Search

Arxiv

15+阅读 · 2018年12月4日

VIP会员

文章信息

相关主题

相关VIP内容

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

INRIA最新「机器学习理论」新书，229页pdf原理性阐述机器学习

INRIA最新「机器学习理论」新书，229页pdf原理性阐述机器学习

专知会员服务

69+阅读 · 2021年3月27日

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

51+阅读 · 2020年12月14日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

【博士论文】用于化学结构抽取的多模态文档理解

《人工智能在军事行动作战规划过程中的应用可能性》

【NeurIPS2025】熵正则化与分布强化学习的收敛定理

智能体安全综述：应用、威胁与防御

相关资讯

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

IEEE ICKG 2022: Call for Papers

IEEE ICKG 2022: Call for Papers

机器学习与推荐算法

3+阅读 · 2022年3月30日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

IEEE TII Call For Papers

IEEE TII Call For Papers

CCF多媒体专委会

3+阅读 · 2022年3月24日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

Call for Nominations: 2022 Multimedia Prize Paper Award

Call for Nominations: 2022 Multimedia Prize Paper Award

CCF多媒体专委会

0+阅读 · 2022年2月12日

局部学习的特征选择：Local-Learning-Based Feature Selection

局部学习的特征选择：Local-Learning-Based Feature Selection

我爱读PAMI

14+阅读 · 2019年9月20日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

相关论文

Multi-Network Joint Source-Channel Coding Over Varying Noise Channels

Arxiv

0+阅读 · 2022年11月1日

kt-Safety: Graph Release via k-Anonymity and t-Closeness (Technical Report)

kt-Safety: Graph Release via k-Anonymity and t-Closeness (Technical Report)

Arxiv

0+阅读 · 2022年10月31日

HARRIS: Hybrid Ranking and Regression Forests for Algorithm Selection

Arxiv

0+阅读 · 2022年10月31日

uBFT: Microsecond-scale BFT using Disaggregated Memory

Arxiv

0+阅读 · 2022年10月31日

Fast and parallel decoding for transducer

Arxiv

0+阅读 · 2022年10月31日

BEPHAP: A Blockchain-Based Efficient Privacy-Preserving Handover Authentication Protocol with Key Agreement for Internet of Vehicles

Arxiv

0+阅读 · 2022年10月29日

Fast Inference for Quantile Regression with Millions of Observations

Arxiv

0+阅读 · 2022年10月28日

A case for disaggregation of ML data processing

Arxiv

0+阅读 · 2022年10月28日

Updating Embeddings for Dynamic Knowledge Graphs

Arxiv

20+阅读 · 2021年9月22日

Detect-to-Retrieve: Efficient Regional Aggregation for Image Search

Arxiv

15+阅读 · 2018年12月4日

相关基金

Sestrin2/AMPK信号通路调控新生鼠缺氧缺血脑损伤细胞自噬的新机制

国家自然科学基金

0+阅读 · 2015年12月31日

BZT-BCT基复合薄膜磁电存储单元及其电致磁电效应研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于下丘脑弓状核-外侧隔核ghrelin神经通路探讨腹部推拿对摄食影响的机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

γ-Synuclein调控MAPK-ERK-JNK信号通路及细胞周期促进子宫内膜癌恶性进展的机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

LOC283683-NIPA1-BMPRII途径对胆固醇平衡和动脉粥样硬化的影响及机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

REGγ在多发性骨髓瘤中的作用及分子机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

Wnt/β-catenin通路介导RELMβ调控糖尿病肾病系膜细胞增殖的机制研究

国家自然科学基金

1+阅读 · 2013年12月31日

SH2B1上调β-catenin 磷酸化增强Wnt/β-cat通路在NSCLC的EMT中作用及机制

国家自然科学基金

0+阅读 · 2013年12月31日

Ghrelin抑制胸腺脂肪细胞生成的分子调控网络研究

国家自然科学基金

0+阅读 · 2012年12月31日

Witten Laplacian的特征值及与其相关的Ricci Soliton研究

国家自然科学基金

0+阅读 · 2012年12月31日

微信扫码咨询专知VIP会员