Communication scheduling has been shown to be effective in accelerating distributed training, which enables all-reduce communications to be overlapped with backpropagation computations. This has been commonly adopted in popular distributed deep learning frameworks. However, there exist two fundamental problems: (1) excessive startup latency proportional to the number of workers for each all-reduce operation; (2) it only achieves sub-optimal training performance due to the dependency and synchronization requirement of the feed-forward computation in the next iteration. We propose a novel scheduling algorithm, DeAR, that decouples the all-reduce primitive into two continuous operations, which overlaps with both backpropagation and feed-forward computations without extra communications. We further design a practical tensor fusion algorithm to improve the training performance. Experimental results with five popular models show that DeAR achieves up to 83% and 15% training speedup over the state-of-the-art solutions on a 64-GPU cluster with 10Gb/s Ethernet and 100Gb/s InfiniBand interconnects, respectively.
翻译:通信时间安排在加快分布式培训方面证明是有效的,使所有减少的通信都能够与回馈计算相重叠,这在流行分布式深层次学习框架中普遍采用,但存在两个根本问题:(1) 与每个全部减少作业的工人人数成比例的过度启动延迟;(2) 由于在下一个迭代中输入前计算需要依赖性和同步性,它只能达到亚最佳培训业绩。我们建议采用新的排期算法DeAR,将全部减少的原始通信分离成两个连续操作,与后推式和进前推式计算重叠,而没有额外的通信。我们进一步设计实用的拉子集成算法,以提高培训绩效。五个流行模型的实验结果表明,DeAR在64GPU集中分别使用10Gb/s Ethernet 和100Gb/s InfiniBand 连接的州级解决方案上实现了高达83%和15%的培训速度。</s>