The continuous growth in both size and training data for modern Deep Neural Networks (DNNs) models has led to training tasks taking days or even months. Distributed training is a solution to reduce training time by splitting the task across multiple NPUs (e.g., GPU/TPU). However, distributed training adds communication overhead between the NPUs in order to synchronize the gradients and/or activation, depending on the parallelization strategy. In today's datacenters, for training at scale, NPUs are connected through multi-dimensional interconnection links with different bandwidth and latency. Hence, keeping all network dimensions busy and maximizing the network BW is a challenging task in such a hybrid network environment, as this work identifies. We propose Themis, a novel collective scheduling scheme that dynamically schedules collectives (divided into chunks) to balance the communication loads across all dimensions, further improving the network BW utilization. Our results show that on average, Themis can improve the network BW utilization of single All-Reduce by 1.88x (2.92x max), and improve the end-to-end training iteration performance of real workloads such as ResNet-50, GNMT, DLRM, and Transformer- 1T by 1.49x (1.96x max), 1.41x (1.81x max), 1.42x (1.80x max), and 1.35x (1.78x max), respectively.
翻译:现代深神经网络(DNN)模式的规模和培训数据不断增长导致培训任务耗时数日甚至甚至数月。分散培训是通过将任务分散在多个国家保护单位(例如GPU/TPU)来减少培训时间的一种解决办法。然而,分布式培训增加了国家保护单位之间的通信间接费用,以便根据平行化战略同步梯度和/或激活。在当今的数据中心中,用于规模培训的国家保护单位通过不同带宽和延缓的多维互联连接连接连接连接。因此,保持所有网络的繁忙和尽量扩大网络BW是一个艰巨的任务,正如这项工作所查明的那样,在这种混合网络环境中是一项艰巨的任务。我们提议,Themis, 一个新的集体时间安排计划,动态地安排集体(分成块块)平衡所有层面的通信负荷,进一步改善网络BW的利用率。我们的结果显示,平均而言,Themis可以改进网络对单一全Ruse的利用(1.88x 2.92x 峰值), 保持网络的繁忙度(1.98xxx ), 改进网络的端端至端一.x 1.50 实际工作量的升级(RMx)。