Top-$k$ sparsification has recently been widely used to reduce the communication volume in distributed deep learning; however, due to Gradient Accumulation (GA) dilemma, the performance of top-$k$ sparsification is still limited. Several methods have been proposed to handle the GA dilemma but have two drawbacks: (1) they are frustrated by the high communication complexity as they introduce a large amount of extra transmission; (2) they are not flexible for non-power-of-two numbers of workers. To solve these two problems, we propose a flexible and efficient sparse communication framework, dubbed SparDL. SparDL uses the Spar-Reduce-Scatter algorithm to solve the GA dilemma without additional communication operations and is flexible to any number of workers. Besides, to further reduce the communication complexity and adjust the proportion of latency and bandwidth cost in communication complexity, we propose the Spar-All-Gather algorithm as part of SparDL. Extensive experiments validate the superiority of SparDL.
翻译:近年来,基于top-k稀疏化技术已被广泛应用于减少分布式深度学习中的通信开销。然而,由于梯度累积问题的限制,top-k稀疏化的效果始终有限。已经有多种解决梯度累积问题的方法被提出,但这些方法普遍存在以下两个缺点:(1)由于引入了大量额外的传输,高通信复杂度极大地影响了它们的性能;(2)这些方法对于非二的幂次方个工作节点并不灵活。为解决这两个问题,我们提出了一个灵活高效的稀疏通信框架SparDL。SparDL使用Spar-Reduce-Scatter算法解决梯度累积问题,无需额外的通信操作,并且可以适应任意数量的工作节点。此外,为了进一步减少通信复杂度并调整通信复杂度中延迟和带宽成本的比例,我们提出了Spar-All-Gather算法作为SparDL的一部分。大量实验证明了SparDL的优越性。