理论:DL模型分配培训的网络带宽-Aware集体安排政策 (Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models)

Distributed training is a solution to reduce DNN training time by splitting the task across multiple NPUs (e.g., GPU/TPU). However, distributed training adds communication overhead between the NPUs in order to synchronize the gradients and/or activation, depending on the parallelization strategy. In next-generation platforms for training at scale, NPUs will be connected through multi-dimensional networks with diverse, heterogeneous bandwidths. This work identifies a looming challenge of keeping all network dimensions busy and maximizing the network BW within the hybrid environment if we leverage scheduling techniques for collective communication on systems today. We propose Themis, a novel collective scheduling scheme that dynamically schedules collectives (divided into chunks) to balance the communication loads across all dimensions, further improving the network BW utilization. Our results show that on average, Themis can improve the network BW utilization of the single All-Reduce by 1.72X (2.70X max), and improve the end-to-end training iteration performance of real workloads such as ResNet-152, GNMT, DLRM, and Transformer-1T by 1.49X (2.25X max), 1.30X (1.78X max), 1.30X (1.77X max), and 1.25X (1.53X max), respectively.

翻译：分散培训是减少DNN培训时间的一个解决办法,通过将任务分散到多个国家联络点(例如,GPU/TPU)来减少DNN培训时间。然而,分散培训增加了国家联络点之间的通信间接费用,以便根据平行战略同步梯度和(或)激活,这取决于平行战略。在下一代培训平台中,国家联络点将通过多维网络和多种多频宽带连接。这项工作发现一个迫在眉睫的挑战,即如果我们利用当前系统集体通信的时间安排技术,所有网络层面都保持繁忙,并在混合环境中将网络BW最大化。我们提议了“Themis”这一新的集体时间安排计划,即动态安排集体(分为块),以平衡所有层面的通信负荷,进一步改善网络BW的利用率。我们的结果显示,平均而言,国家联络点可以改善网络对单一全网的利用率,减少1.72X(2.70xxxx最高),提高实际工作量如ResNet-152、GNMT、DLRMM和MF-1-X最低培训的端至端培训绩效(1.375x)、1.X、1.37x、1.X、1.37x、1.X、1.47x、1.X、3.X、1.47x、1.X、1.X、1.X、1.X、1.45、1.X、1.X、1.X、3.X、3.X、3.X、3.X、3.X、3.X、3.X、3.X、3.Xxxxxxxxxxxxxxxxxxxx、1.30、1.30、1.30、1.30、1.30、1.30、1.30、1.30、1.30、1.30、1.30、1.30、1.30、3.15、1.30、1.30、3.4、1.30、3.4、3.4、3.4、3.4、3.4、3.4、3.25、3.4、3.4、3.4、3.4、3.4、3.4、3.4、3.4、3.X、3.X、3.4、3.4、3.X、3.4、3.4、3.4、3.4、3.4、3.4、3.4、3.4、3.X、3.X、3.X、3.X、4.X、3.X、3.X、3.X、3.X、3.X、3.X、3.4、4.X、3.

相关内容

Networking

关注 22

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

75+阅读 · 2022年6月28日

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

45+阅读 · 2020年10月31日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【深度学习架构、模型和技巧集合(TensorFlow/PyTorch)】’Deep Learning Models - A collection of various deep learning architectures, models, and tips'

专知会员服务

58+阅读 · 2020年1月25日