云中分布的深空学习分析 (Analysis of Distributed Deep Learning in the Cloud) - 专知论文

会员服务 ·

0

DNN · 示例 · MoDELS · Analysis · Learning ·

2022 年 12 月 22 日

Analysis of Distributed Deep Learning in the Cloud

翻译：云中分布的深空学习分析

Aakash Sharma,Vivek M. Bhasi,Sonali Singh,Rishabh Jain,Jashwant Raj Gunasekaran,Subrata Mitra,Mahmut Taylan Kandemir,George Kesidis,Chita R. Das

We aim to resolve this problem by introducing a comprehensive distributed deep learning (DDL) profiler, which can determine the various execution "stalls" that DDL suffers from while running on a public cloud. We have implemented the profiler by extending prior work to additionally estimate two types of communication stalls - interconnect and network stalls. We train popular DNN models using the profiler to characterize various AWS GPU instances and list their advantages and shortcomings for users to make an informed decision. We observe that the more expensive GPU instances may not be the most performant for all DNN models and AWS may sub-optimally allocate hardware interconnect resources. Specifically, the intra-machine interconnect can introduce communication overheads up to 90% of DNN training time and network-connected instances can suffer from up to 5x slowdown compared to training on a single instance. Further, we model the impact of DNN macroscopic features such as the number of layers and the number of gradients on communication stalls. Finally, we propose a measurement-based recommendation model for users to lower their public cloud monetary costs for DDL, given a time budget.

翻译：我们的目标是通过引入一个全面的分布式深度学习(DDL)剖面仪来解决这一问题,该剖面仪可以确定DDL在公共云层运行时遇到的各种执行“摊位”情况。我们已经实施了剖面仪,将先前的工作扩大到额外估计两种类型的通信摊位——互连和网络摊位。我们用剖面仪来培训流行的DNN模型,以描述AWS GPU的各种实例,并列出其优缺点,供用户作出知情决定。我们发现,对于所有DNN模型来说,费用更高的GPU实例可能不是最优秀的,而AWS可能不尽善地分配硬件互联资源。具体地说,机器内部连接可以引入高达90%的DNN培训时间和网络连接案例的通信管理费,比单例的培训速度可能减速5x。此外,我们模拟DNN宏观特征的影响,例如层数和对通信摊位的梯度数。最后,我们提出了一个基于测量的建议模型,供用户降低DNNTL的公共云面货币成本,并视时间预算而定。

0

相关内容

DNN

哥伦比亚大学最新《机器学习》课程，Fall-B 2020 (Machine Learning)

专知会员服务

39+阅读 · 2020年11月3日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【斯坦福】机器学习优化简明导论， Introduction to Optimization for Machine Learning

【斯坦福】机器学习优化简明导论， Introduction to Optimization for Machine Learning

专知会员服务

93+阅读 · 2020年5月6日

【新书：机器学习简介】《A Concise Introduction to Machine Learning》by A.C. Faul (CRC 2019)

【新书：机器学习简介】《A Concise Introduction to Machine Learning》by A.C. Faul (CRC 2019)

专知会员服务

77+阅读 · 2020年2月8日

【文献综述】分布式机器学习综述论文，33页pdf，A Survey on Distributed Machine Learning

【文献综述】分布式机器学习综述论文，33页pdf，A Survey on Distributed Machine Learning

专知会员服务

124+阅读 · 2019年12月23日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

图与推荐

2+阅读 · 2022年11月2日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【新书发布】原作者MarcG.Bellemare发布315页分布强化学习书籍(DistributionalRL)

【新书发布】原作者MarcG.Bellemare发布315页分布强化学习书籍(DistributionalRL)

深度强化学习实验室

1+阅读 · 2022年1月11日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium1

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium1

中国图象图形学学会CSIG

0+阅读 · 2021年11月3日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

深度自进化聚类：Deep Self-Evolution Clustering

深度自进化聚类：Deep Self-Evolution Clustering

我爱读PAMI

15+阅读 · 2019年4月13日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

Fe基块体非晶合金中异质非晶结构及纳米晶形成演变机理

国家自然科学基金

0+阅读 · 2015年12月31日

金刚石–有机聚合物杂化离子色谱固定相的研制及应用

国家自然科学基金

0+阅读 · 2015年12月31日

高功率石墨烯束流窗口的研究

国家自然科学基金

0+阅读 · 2015年12月31日

蓖麻矮化相关RcDof基因功能分析及调控机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于SiPM的高性能In-Beam TOF-PET的研究

国家自然科学基金

0+阅读 · 2014年12月31日

Calderon问题和边界刚性问题

国家自然科学基金

0+阅读 · 2013年12月31日

RhoGDI2用于肺癌临床预后评估及参与侵袭转移分子机制的研究

国家自然科学基金

0+阅读 · 2012年12月31日

原位研究Ti基大块非晶合金拉伸过程多尺度应变及剪切带特征

国家自然科学基金

0+阅读 · 2012年12月31日

铝偏析法提纯中固液界面异质原子迁移行为

国家自然科学基金

0+阅读 · 2012年12月31日

辅助电脉冲低温扩散焊连接Ti(C,N)金属陶瓷与40Cr的机理研究

国家自然科学基金

0+阅读 · 2011年12月31日

An Analysis of Collocation on GPUs for Deep Learning Training

Arxiv

0+阅读 · 2023年2月23日

Faabric: Fine-Grained Distribution of Scientific Workloads in the Cloud

Arxiv

0+阅读 · 2023年2月22日

DISCO: Distributed Inference with Sparse Communications

Arxiv

0+阅读 · 2023年2月22日

Deep Learning Based 3D Point Cloud Regression for Estimating Forest Biomass

Arxiv

0+阅读 · 2023年2月21日

FedSDG-FS: Efficient and Secure Feature Selection for Vertical Federated Learning

Arxiv

0+阅读 · 2023年2月21日

A Survey of Learning on Small Data

Arxiv

19+阅读 · 2022年7月29日

A Survey of Human-in-the-loop for Machine Learning

Arxiv

35+阅读 · 2021年8月2日

The Principles of Deep Learning Theory

Arxiv

66+阅读 · 2021年6月18日

Learning in the Frequency Domain

Learning in the Frequency Domain

Arxiv

11+阅读 · 2020年3月12日

A Survey on Distributed Machine Learning

Arxiv

45+阅读 · 2019年12月20日

VIP会员

文章信息

相关主题

相关VIP内容

哥伦比亚大学最新《机器学习》课程，Fall-B 2020 (Machine Learning)

专知会员服务

39+阅读 · 2020年11月3日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【斯坦福】机器学习优化简明导论， Introduction to Optimization for Machine Learning

【斯坦福】机器学习优化简明导论， Introduction to Optimization for Machine Learning

专知会员服务

93+阅读 · 2020年5月6日

【新书：机器学习简介】《A Concise Introduction to Machine Learning》by A.C. Faul (CRC 2019)

【新书：机器学习简介】《A Concise Introduction to Machine Learning》by A.C. Faul (CRC 2019)

专知会员服务

77+阅读 · 2020年2月8日

【文献综述】分布式机器学习综述论文，33页pdf，A Survey on Distributed Machine Learning

【文献综述】分布式机器学习综述论文，33页pdf，A Survey on Distributed Machine Learning

专知会员服务

124+阅读 · 2019年12月23日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

大语言模型时代的文档智能：综述

蜂窝通信是否是无人机与无人地面战车主宰战场的关键？

文档视觉问答简述

最新新Agent综述！76页327篇论文梳理，北交大桑基韬教授团队发布《迈向模型原生智能体式人工智能的范式转变综述》

相关资讯

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

图与推荐

2+阅读 · 2022年11月2日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【新书发布】原作者MarcG.Bellemare发布315页分布强化学习书籍(DistributionalRL)

【新书发布】原作者MarcG.Bellemare发布315页分布强化学习书籍(DistributionalRL)

深度强化学习实验室

1+阅读 · 2022年1月11日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium1

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium1

中国图象图形学学会CSIG

0+阅读 · 2021年11月3日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

深度自进化聚类：Deep Self-Evolution Clustering

深度自进化聚类：Deep Self-Evolution Clustering

我爱读PAMI

15+阅读 · 2019年4月13日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

相关论文

An Analysis of Collocation on GPUs for Deep Learning Training

Arxiv

0+阅读 · 2023年2月23日

Faabric: Fine-Grained Distribution of Scientific Workloads in the Cloud

Arxiv

0+阅读 · 2023年2月22日

DISCO: Distributed Inference with Sparse Communications

Arxiv

0+阅读 · 2023年2月22日

Deep Learning Based 3D Point Cloud Regression for Estimating Forest Biomass

Arxiv

0+阅读 · 2023年2月21日

FedSDG-FS: Efficient and Secure Feature Selection for Vertical Federated Learning

Arxiv

0+阅读 · 2023年2月21日

A Survey of Learning on Small Data

Arxiv

19+阅读 · 2022年7月29日

A Survey of Human-in-the-loop for Machine Learning

Arxiv

35+阅读 · 2021年8月2日

The Principles of Deep Learning Theory

Arxiv

66+阅读 · 2021年6月18日

Learning in the Frequency Domain

Learning in the Frequency Domain

Arxiv

11+阅读 · 2020年3月12日

A Survey on Distributed Machine Learning

Arxiv

45+阅读 · 2019年12月20日

相关基金

Fe基块体非晶合金中异质非晶结构及纳米晶形成演变机理

国家自然科学基金

0+阅读 · 2015年12月31日

金刚石–有机聚合物杂化离子色谱固定相的研制及应用

国家自然科学基金

0+阅读 · 2015年12月31日

高功率石墨烯束流窗口的研究

国家自然科学基金

0+阅读 · 2015年12月31日

蓖麻矮化相关RcDof基因功能分析及调控机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于SiPM的高性能In-Beam TOF-PET的研究

国家自然科学基金

0+阅读 · 2014年12月31日

Calderon问题和边界刚性问题

国家自然科学基金

0+阅读 · 2013年12月31日

RhoGDI2用于肺癌临床预后评估及参与侵袭转移分子机制的研究

国家自然科学基金

0+阅读 · 2012年12月31日

原位研究Ti基大块非晶合金拉伸过程多尺度应变及剪切带特征

国家自然科学基金

0+阅读 · 2012年12月31日

铝偏析法提纯中固液界面异质原子迁移行为

国家自然科学基金

0+阅读 · 2012年12月31日

辅助电脉冲低温扩散焊连接Ti(C,N)金属陶瓷与40Cr的机理研究

国家自然科学基金

0+阅读 · 2011年12月31日

微信扫码咨询专知VIP会员