重新思考云云数据中心数据处理管道的存储管理 (Rethinking Storage Management for Data Processing Pipelines in Cloud Data Centers) - 专知论文

会员服务 ·

0

Processing（编程语言） · Storage · Analysis · 值域 · Spark ·

2022 年 11 月 4 日

Rethinking Storage Management for Data Processing Pipelines in Cloud Data Centers

翻译：重新思考云云数据中心数据处理管道的存储管理

Ubaid Ullah Hafeez,Martin Maas,Mustafa Uysal,Richard McDougall

Data processing frameworks such as Apache Beam and Apache Spark are used for a wide range of applications, from logs analysis to data preparation for DNN training. It is thus unsurprising that there has been a large amount of work on optimizing these frameworks, including their storage management. The shift to cloud computing requires optimization across all pipelines concurrently running across a cluster. In this paper, we look at one specific instance of this problem: placement of I/O-intensive temporary intermediate data on SSD and HDD. Efficient data placement is challenging since I/O density is usually unknown at the time data needs to be placed. Additionally, external factors such as load variability, job preemption, or job priorities can impact job completion times, which ultimately affect the I/O density of the temporary files in the workload. In this paper, we envision that machine learning can be used to solve this problem. We analyze production logs from Google's data centers for a range of data processing pipelines. Our analysis shows that I/O density may be predictable. This suggests that learning-based strategies, if crafted carefully, could extract predictive features for I/O density of temporary files involved in various transformations, which could be used to improve the efficiency of storage management in data processing pipelines.

翻译：Apache Beam和Apache Spark等数据处理框架用于从日志分析到DNN培训数据准备等范围广泛的应用,从日志分析到DN培训的数据准备等,因此,在优化这些框架包括储存管理方面有大量工作,因此,毫不奇怪,向云计算转变需要同时通过一个组群运行的所有管道优化。在本文件中,我们研究这一问题的一个具体实例:在SSD和HDDD上放置I/O密集型临时中间数据。高效数据定位具有挑战性,因为通常在需要放置数据时I/O密度未知。此外,负荷变异性、工作预设或工作优先事项等外部因素可能会影响工作完成时间,最终会影响工作量中临时文件的I/O密度。在本文件中,我们设想机器学习可用于解决这一问题。我们分析谷歌数据中心的一系列数据处理管道的生产日志。我们的分析表明,I/O密度可能是可以预测的。这意味着,学习战略,如果精心设计,可以提取临时文件的I/O密度的预测性特征,用于各种管道转换中所使用的数据管理效率。

0

相关内容

Processing（编程语言）

Processing（编程语言）

Processing 是一门开源编程语言和与之配套的集成开发环境（IDE）的名称。Processing 在电子艺术和视觉设计社区被用来教授编程基础，并运用于大量的新媒体和互动艺术作品中。

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

104+阅读 · 2022年2月10日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

【新书：机器学习简介】《A Concise Introduction to Machine Learning》by A.C. Faul (CRC 2019)

【新书：机器学习简介】《A Concise Introduction to Machine Learning》by A.C. Faul (CRC 2019)

专知会员服务

77+阅读 · 2020年2月8日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

图与推荐

2+阅读 · 2022年11月2日

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

ACM TOMM Call for Papers

ACM TOMM Call for Papers

CCF多媒体专委会

2+阅读 · 2022年3月23日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Latest News & Announcements of the Tutorial

【ICIG2021】Latest News & Announcements of the Tutorial

中国图象图形学学会CSIG

3+阅读 · 2021年12月20日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium4

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium4

中国图象图形学学会CSIG

0+阅读 · 2021年11月10日

【ICIG2021】Latest News & Announcements of the Industry Talk1

【ICIG2021】Latest News & Announcements of the Industry Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年7月28日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

间充质干细胞通过调控巨噬细胞极化治疗狼疮肾炎的机制

国家自然科学基金

0+阅读 · 2015年12月31日

基于长距离参考站网的GPS/BDS高精度实时动态定位算法研究

国家自然科学基金

0+阅读 · 2015年12月31日

PP2Cδ调控的线粒体ROS通路在肺损伤和炎症中的作用机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于ERK1/2通路MicroRNAs调控探讨泽泻汤对AS VSMC迁移增殖的机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

FIP200蛋白对细胞氧化还原状态调控的研究

国家自然科学基金

0+阅读 · 2013年12月31日

从TLR4/NF-κB通路探索补肾抗衰片调节脂联素表达干预非酒精性脂肪肝所致动脉粥样硬化的机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

4f和3d电子调控下的新型In和Te基稀土1：3型半导体化合物的磁输运和结构

国家自然科学基金

0+阅读 · 2012年12月31日

CPU/GPGPU紧耦合异构多核系统共享Last Level Cache优化研究

国家自然科学基金

0+阅读 · 2012年12月31日

晶圆制造Interbay物料运输系统的动态调度研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于RSerPool应用的SCTP传输协议传输路径的优化研究

国家自然科学基金

0+阅读 · 2011年12月31日

Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective

Arxiv

0+阅读 · 2022年12月26日

Towards Improved Prediction of Ship Performance: A Comparative Analysis on In-service Ship Monitoring Data for Modeling the Speed-Power Relation

Arxiv

1+阅读 · 2022年12月26日

Analysis of Distributed Deep Learning in the Cloud

Arxiv

0+阅读 · 2022年12月22日

Self-Supervised Pre-training of 3D Point Cloud Networks with Image Data

Arxiv

0+阅读 · 2022年12月16日

Arxiv

0+阅读 · 2022年12月9日

Incremental Predictive Coding: A Parallel and Fully Automatic Learning Algorithm

Arxiv

0+阅读 · 2022年11月16日

A Survey of Deep Causal Model

Arxiv

45+阅读 · 2022年9月19日

Causal Inference in Recommender Systems: A Survey and Future Directions

Arxiv

16+阅读 · 2022年8月26日

A Survey of Machine Learning for Computer Architecture and Systems

Arxiv

18+阅读 · 2021年2月16日

Order-Free RNN with Visual Attention for Multi-Label Classification

Arxiv

16+阅读 · 2017年12月20日

VIP会员

文章信息

相关主题

Processing（编程语言）

相关VIP内容

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

104+阅读 · 2022年2月10日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

【新书：机器学习简介】《A Concise Introduction to Machine Learning》by A.C. Faul (CRC 2019)

【新书：机器学习简介】《A Concise Introduction to Machine Learning》by A.C. Faul (CRC 2019)

专知会员服务

77+阅读 · 2020年2月8日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

从社会学实验到行为仿真：理解基于Agent的观点动力学建模思维

中英文版《GPT-5 System Card速览》报告

ACL 2025 | 大模型结构化知识提示的泛化能力研究

【普林斯顿博士论文】大型模型的高效推理

相关资讯

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

图与推荐

2+阅读 · 2022年11月2日

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

ACM TOMM Call for Papers

ACM TOMM Call for Papers

CCF多媒体专委会

2+阅读 · 2022年3月23日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Latest News & Announcements of the Tutorial

【ICIG2021】Latest News & Announcements of the Tutorial

中国图象图形学学会CSIG

3+阅读 · 2021年12月20日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium4

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium4

中国图象图形学学会CSIG

0+阅读 · 2021年11月10日

【ICIG2021】Latest News & Announcements of the Industry Talk1

【ICIG2021】Latest News & Announcements of the Industry Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年7月28日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

相关论文

Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective

Arxiv

0+阅读 · 2022年12月26日

Towards Improved Prediction of Ship Performance: A Comparative Analysis on In-service Ship Monitoring Data for Modeling the Speed-Power Relation

Arxiv

1+阅读 · 2022年12月26日

Analysis of Distributed Deep Learning in the Cloud

Arxiv

0+阅读 · 2022年12月22日

Self-Supervised Pre-training of 3D Point Cloud Networks with Image Data

Arxiv

0+阅读 · 2022年12月16日

Arxiv

0+阅读 · 2022年12月9日

Incremental Predictive Coding: A Parallel and Fully Automatic Learning Algorithm

Arxiv

0+阅读 · 2022年11月16日

A Survey of Deep Causal Model

Arxiv

45+阅读 · 2022年9月19日

Causal Inference in Recommender Systems: A Survey and Future Directions

Arxiv

16+阅读 · 2022年8月26日

A Survey of Machine Learning for Computer Architecture and Systems

Arxiv

18+阅读 · 2021年2月16日

Order-Free RNN with Visual Attention for Multi-Label Classification

Arxiv

16+阅读 · 2017年12月20日

相关基金

间充质干细胞通过调控巨噬细胞极化治疗狼疮肾炎的机制

国家自然科学基金

0+阅读 · 2015年12月31日

基于长距离参考站网的GPS/BDS高精度实时动态定位算法研究

国家自然科学基金

0+阅读 · 2015年12月31日

PP2Cδ调控的线粒体ROS通路在肺损伤和炎症中的作用机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于ERK1/2通路MicroRNAs调控探讨泽泻汤对AS VSMC迁移增殖的机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

FIP200蛋白对细胞氧化还原状态调控的研究

国家自然科学基金

0+阅读 · 2013年12月31日

从TLR4/NF-κB通路探索补肾抗衰片调节脂联素表达干预非酒精性脂肪肝所致动脉粥样硬化的机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

4f和3d电子调控下的新型In和Te基稀土1：3型半导体化合物的磁输运和结构

国家自然科学基金

0+阅读 · 2012年12月31日

CPU/GPGPU紧耦合异构多核系统共享Last Level Cache优化研究

国家自然科学基金

0+阅读 · 2012年12月31日

晶圆制造Interbay物料运输系统的动态调度研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于RSerPool应用的SCTP传输协议传输路径的优化研究

国家自然科学基金

0+阅读 · 2011年12月31日

微信扫码咨询专知VIP会员