Recent trends in high-performance computing and deep learning have led to the proliferation of studies on large-scale deep neural network training. However, the frequent communication requirements among computation nodes drastically slows the overall training speeds, which causes bottlenecks in distributed training, particularly in clusters with limited network bandwidths. To mitigate the drawbacks of distributed communications, researchers have proposed various optimization strategies. In this paper, we provide a comprehensive survey of communication strategies from both an algorithm viewpoint and a computer network perspective. Algorithm optimizations focus on reducing the communication volumes used in distributed training, while network optimizations focus on accelerating the communications between distributed devices. At the algorithm level, we describe how to reduce the number of communication rounds and transmitted bits per round. In addition, we elucidate how to overlap computation and communication. At the network level, we discuss the effects caused by network infrastructures, including logical communication schemes and network protocols. Finally, we extrapolate the potential future challenges and new research directions to accelerate communications for distributed deep neural network training.
翻译:最近高性能计算和深层次学习的趋势导致关于大型深神经网络培训的研究激增。然而,计算节点之间频繁的通信要求大大减缓了总体培训速度,造成分布式培训瓶颈,特别是在网络带宽有限的集群中造成瓶颈。为减轻分布式通信的缺陷,研究人员提出了各种优化战略。我们从算法和计算机网络的角度对通信战略进行全面调查。对分配式培训使用通信量的优化侧重于减少分配式培训中所使用的通信量,而网络优化侧重于加快分布式设备之间的通信。在算法层面,我们描述了如何减少通信回合和每轮传输点的数量。此外,我们阐述了如何重叠计算和通信。在网络层面,我们讨论了网络基础设施的影响,包括逻辑通信计划和网络协议。最后,我们推断了未来潜在的挑战和新的研究方向,以加速传播式的深度神经网络培训。