快速分配的DNN培训高效管道规划 (Efficient Pipeline Planning for Expedited Distributed DNN Training) - 专知论文

会员服务 ·

0

Performer · DNN · MoDELS · Extensibility · Processing（编程语言） ·

2022 年 4 月 22 日

Efficient Pipeline Planning for Expedited Distributed DNN Training

翻译：快速分配的DNN培训高效管道规划

Ziyue Luo,Xiaodong Yi,Guoping Long,Shiqing Fan,Chuan Wu,Jun Yang,Wei Lin

from arxiv, INFOCOM 2022

To train modern large DNN models, pipeline parallelism has recently emerged, which distributes the model across GPUs and enables different devices to process different microbatches in pipeline. Earlier pipeline designs allow multiple versions of model parameters to co-exist (similar to asynchronous training), and cannot ensure the same model convergence and accuracy performance as without pipelining. Synchronous pipelining has recently been proposed which ensures model performance by enforcing a synchronization barrier between training iterations. Nonetheless, the synchronization barrier requires waiting for gradient aggregation from all microbatches and thus delays the training progress. Optimized pipeline planning is needed to minimize such wait and hence the training time, which has not been well studied in the literature. This paper designs efficient, near-optimal algorithms for expediting synchronous pipeline-parallel training of modern large DNNs over arbitrary inter-GPU connectivity. Our algorithm framework comprises two components: a pipeline partition and device mapping algorithm, and a pipeline scheduler that decides processing order of microbatches over the partitions, which together minimize the per-iteration training time. We conduct thorough theoretical analysis, extensive testbed experiments and trace-driven simulation, and demonstrate our scheme can accelerate training up to 157% compared with state-of-the-art designs.

翻译：为了培训现代大型 DNN 模型,最近出现了管道平行关系,最近出现了现代大型 DNN 模型,这种平行关系正在出现,将模型分布在GPU上,使不同的装置能够在管道中处理不同的微插管。早期管道设计允许多种版本的模型参数同时存在(类似于不同步培训 ), 无法确保模型的趋同和准确性能与没有管道的相同。最近还提出了同步的管线,通过在培训迭代之间设置同步屏障,确保模型性能。然而,同步屏障需要等待所有微型管道的梯度聚合,从而拖延培训进度。优化管道规划需要优化的管道规划,以尽量减少这种等待,从而缩短培训时间,文献中对此研究不够充分。本文设计了高效的、接近最佳的算法,以加速同步的方式对现代大型DNNNN进行管道平行培训,而不是任意的GPUPU的连接。我们的算法框架由两个部分组成:管道隔断和装置绘图算法,以及一个管道调度器,决定处理隔段上的微型管线的顺序,从而最大限度地减少每次训练时间。我们进行第157 的模拟和跟踪模拟计划将第157 模拟和追踪计划演示,我们进行第157 演示到加速的进度规划。

0

相关内容

Performer

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

Python分布式计算，171页pdf，Distributed Computing with Python

Python分布式计算，171页pdf，Distributed Computing with Python

专知会员服务

108+阅读 · 2020年5月3日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

UC.Berkeley CS189讲义教材:《机器学习全面指南》，185页pdf

专知会员服务

162+阅读 · 2020年1月16日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

【新书发布】原作者MarcG.Bellemare发布315页分布强化学习书籍(DistributionalRL)

【新书发布】原作者MarcG.Bellemare发布315页分布强化学习书籍(DistributionalRL)

深度强化学习实验室

1+阅读 · 2022年1月11日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium9

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium9

中国图象图形学学会CSIG

0+阅读 · 2021年12月17日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

中国图象图形学学会CSIG

0+阅读 · 2021年11月16日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium7

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium7

中国图象图形学学会CSIG

0+阅读 · 2021年11月15日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium5

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium5

中国图象图形学学会CSIG

1+阅读 · 2021年11月11日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium4

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium4

中国图象图形学学会CSIG

0+阅读 · 2021年11月10日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium3

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium3

中国图象图形学学会CSIG

0+阅读 · 2021年11月9日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium2

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium2

中国图象图形学学会CSIG

0+阅读 · 2021年11月8日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium1

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium1

中国图象图形学学会CSIG

0+阅读 · 2021年11月3日

双酚A通过PTEN通路致原始卵泡过度激活及其在卵巢早衰发生中的作用

国家自然科学基金

0+阅读 · 2015年12月31日

肌肉因子白介素15在运动改善胰岛素抵抗中的作用及机制

国家自然科学基金

0+阅读 · 2014年12月31日

具有临界指数的Schrodinger-Poisson系统的解

国家自然科学基金

0+阅读 · 2013年12月31日

miR-124靶向TRAF6在骨肉瘤中的作用

国家自然科学基金

0+阅读 · 2013年12月31日

RIP2调控CD40-NF-кB信号通路在血管内皮细胞损伤中的作用机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

线粒体TRAP1抑制肾小管上皮细胞凋亡在肾间质纤维化中的作用机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

TREM-1/DAP12/ NF-κB信号通路在6-姜烯酚抗动脉粥样硬化中的作用研究

国家自然科学基金

0+阅读 · 2012年12月31日

MDSCs在动脉粥样硬化中的作用及机制

国家自然科学基金

0+阅读 · 2012年12月31日

TFPI-2对巨噬细胞胆固醇流入/流出通路的作用及分子机制

国家自然科学基金

0+阅读 · 2012年12月31日

KLF2在动脉粥样硬化炎症中的负性调控作用及机制

国家自然科学基金

0+阅读 · 2011年12月31日

Merak: A Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models

Arxiv

0+阅读 · 2022年6月10日

Practical Lossless Federated Singular Vector Decomposition over Billion-Scale Data

Arxiv

0+阅读 · 2022年6月10日

Decentralized Training of Foundation Models in Heterogeneous Environments

Arxiv

0+阅读 · 2022年6月10日

Accurate and efficient Simulation of very high-dimensional Neural Mass Models with distributed-delay Connectome Tensors

Arxiv

0+阅读 · 2022年6月10日

Planning with Dynamically Estimated Action Costs

Arxiv

0+阅读 · 2022年6月8日

Probabilistically Robust Learning: Balancing Average- and Worst-case Performance

Arxiv

0+阅读 · 2022年6月7日

Improving Makespan in Dynamic Task Scheduling for Cloud Robotic Systems with Time Window Constraints

Arxiv

0+阅读 · 2022年6月7日

Vertical Federated Edge Learning with Distributed Integrated Sensing and Communication

Arxiv

0+阅读 · 2022年6月7日

Spectral Bias Outside the Training Set for Deep Networks in the Kernel Regime

Arxiv

0+阅读 · 2022年6月6日

Length preserving numerical schemes for Landau-Lifshitz equation based on Lagrange multiplier approaches

Arxiv

0+阅读 · 2022年6月6日

VIP会员

文章信息

相关主题

Processing（编程语言）

相关VIP内容

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

Python分布式计算，171页pdf，Distributed Computing with Python

Python分布式计算，171页pdf，Distributed Computing with Python

专知会员服务

108+阅读 · 2020年5月3日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

UC.Berkeley CS189讲义教材:《机器学习全面指南》，185页pdf

专知会员服务

162+阅读 · 2020年1月16日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

【CMU博士论文】以人为中心的强化学习

任务规划与地形分析：现代复杂环境作战导航体系

认知优势：人工智能在国家安全决策中的核心作用

大模型赋能的具身智能：决策与具身学习综述

相关资讯

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

【新书发布】原作者MarcG.Bellemare发布315页分布强化学习书籍(DistributionalRL)

【新书发布】原作者MarcG.Bellemare发布315页分布强化学习书籍(DistributionalRL)

深度强化学习实验室

1+阅读 · 2022年1月11日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium9

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium9

中国图象图形学学会CSIG

0+阅读 · 2021年12月17日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

中国图象图形学学会CSIG

0+阅读 · 2021年11月16日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium7

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium7

中国图象图形学学会CSIG

0+阅读 · 2021年11月15日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium5

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium5

中国图象图形学学会CSIG

1+阅读 · 2021年11月11日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium4

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium4

中国图象图形学学会CSIG

0+阅读 · 2021年11月10日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium3

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium3

中国图象图形学学会CSIG

0+阅读 · 2021年11月9日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium2

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium2

中国图象图形学学会CSIG

0+阅读 · 2021年11月8日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium1

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium1

中国图象图形学学会CSIG

0+阅读 · 2021年11月3日

相关论文

Merak: A Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models

Arxiv

0+阅读 · 2022年6月10日

Practical Lossless Federated Singular Vector Decomposition over Billion-Scale Data

Arxiv

0+阅读 · 2022年6月10日

Decentralized Training of Foundation Models in Heterogeneous Environments

Arxiv

0+阅读 · 2022年6月10日

Accurate and efficient Simulation of very high-dimensional Neural Mass Models with distributed-delay Connectome Tensors

Arxiv

0+阅读 · 2022年6月10日

Planning with Dynamically Estimated Action Costs

Arxiv

0+阅读 · 2022年6月8日

Probabilistically Robust Learning: Balancing Average- and Worst-case Performance

Arxiv

0+阅读 · 2022年6月7日

Improving Makespan in Dynamic Task Scheduling for Cloud Robotic Systems with Time Window Constraints

Arxiv

0+阅读 · 2022年6月7日

Vertical Federated Edge Learning with Distributed Integrated Sensing and Communication

Arxiv

0+阅读 · 2022年6月7日

Spectral Bias Outside the Training Set for Deep Networks in the Kernel Regime

Arxiv

0+阅读 · 2022年6月6日

Length preserving numerical schemes for Landau-Lifshitz equation based on Lagrange multiplier approaches

Arxiv

0+阅读 · 2022年6月6日

相关基金

双酚A通过PTEN通路致原始卵泡过度激活及其在卵巢早衰发生中的作用

国家自然科学基金

0+阅读 · 2015年12月31日

肌肉因子白介素15在运动改善胰岛素抵抗中的作用及机制

国家自然科学基金

0+阅读 · 2014年12月31日

具有临界指数的Schrodinger-Poisson系统的解

国家自然科学基金

0+阅读 · 2013年12月31日

miR-124靶向TRAF6在骨肉瘤中的作用

国家自然科学基金

0+阅读 · 2013年12月31日

RIP2调控CD40-NF-кB信号通路在血管内皮细胞损伤中的作用机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

线粒体TRAP1抑制肾小管上皮细胞凋亡在肾间质纤维化中的作用机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

TREM-1/DAP12/ NF-κB信号通路在6-姜烯酚抗动脉粥样硬化中的作用研究

国家自然科学基金

0+阅读 · 2012年12月31日

MDSCs在动脉粥样硬化中的作用及机制

国家自然科学基金

0+阅读 · 2012年12月31日

TFPI-2对巨噬细胞胆固醇流入/流出通路的作用及分子机制

国家自然科学基金

0+阅读 · 2012年12月31日

KLF2在动脉粥样硬化炎症中的负性调控作用及机制

国家自然科学基金

0+阅读 · 2011年12月31日

微信扫码咨询专知VIP会员