理论:DL模型分配培训的网络带宽-Aware集体安排政策 (Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models)

The continuous growth in both size and training data for modern Deep Neural Networks (DNNs) models has led to training tasks taking days or even months. Distributed training is a solution to reduce training time by splitting the task across multiple NPUs (e.g., GPU/TPU). However, distributed training adds communication overhead between the NPUs in order to synchronize the gradients and/or activation, depending on the parallelization strategy. In today's datacenters, for training at scale, NPUs are connected through multi-dimensional interconnection links with different bandwidth and latency. Hence, keeping all network dimensions busy and maximizing the network BW is a challenging task in such a hybrid network environment, as this work identifies. We propose Themis, a novel collective scheduling scheme that dynamically schedules collectives (divided into chunks) to balance the communication loads across all dimensions, further improving the network BW utilization. Our results show that on average, Themis can improve the network BW utilization of single All-Reduce by 1.88x (2.92x max), and improve the end-to-end training iteration performance of real workloads such as ResNet-50, GNMT, DLRM, and Transformer- 1T by 1.49x (1.96x max), 1.41x (1.81x max), 1.42x (1.80x max), and 1.35x (1.78x max), respectively.

翻译：现代深神经网络(DNN)模式的规模和培训数据不断增长导致培训任务耗时数日甚至甚至数月。分散培训是通过将任务分散在多个国家保护单位(例如GPU/TPU)来减少培训时间的一种解决办法。然而,分布式培训增加了国家保护单位之间的通信间接费用,以便根据平行化战略同步梯度和/或激活。在当今的数据中心中,用于规模培训的国家保护单位通过不同带宽和延缓的多维互联连接连接连接连接。因此,保持所有网络的繁忙和尽量扩大网络BW是一个艰巨的任务,正如这项工作所查明的那样,在这种混合网络环境中是一项艰巨的任务。我们提议,Themis, 一个新的集体时间安排计划,动态地安排集体(分成块块)平衡所有层面的通信负荷,进一步改善网络BW的利用率。我们的结果显示,平均而言,Themis可以改进网络对单一全Ruse的利用(1.88x 2.92x 峰值), 保持网络的繁忙度(1.98xxx ), 改进网络的端端至端一.x 1.50 实际工作量的升级(RMx)。

相关内容

Networking

关注 22

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

Python分布式计算，171页pdf，Distributed Computing with Python

专知会员服务

108+阅读 · 2020年5月3日

【2020关键词提取】基于深度神经网络的关键词提取，Keywords extraction with deep neural network model

专知会员服务

60+阅读 · 2020年5月2日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

网络流量监测与分析大数据综述，A Survey on Big Data for Network Traffic Monitoring and Analysis

专知会员服务

65+阅读 · 2020年3月5日