Distributed synchronous stochastic gradient descent has been widely used to train deep neural networks (DNNs) on computer clusters. With the increase of computational power, network communications generally limit the system scalability. Wait-free backpropagation (WFBP) is a popular solution to overlap communications with computations during the training process. In this paper, we observe that many DNNs have a large number of layers with only a small amount of data to be communicated at each layer in distributed training, which could make WFBP inefficient. Based on the fact that merging some short communication tasks into a single one can reduce the overall communication time, we formulate an optimization problem to minimize the training time in pipelining communications and computations. We derive an optimal solution that can be solved efficiently without affecting the training performance. We then apply the solution to propose a distributed training algorithm named merged-gradient WFBP (MG-WFBP) and implement it in two platforms Caffe and PyTorch. Extensive experiments in three GPU clusters are conducted to verify the effectiveness of MG-WFBP. We further exploit trace-based simulations of 4 to 2048 GPUs to explore the potential scaling efficiency of MG-WFBP. Experimental results show that MG-WFBP achieves much better scaling performance than existing methods.
翻译:由于计算能力增加,网络通信通常会限制系统的可伸缩性。在培训过程中,不等待的反向宣传(WFBP)是将通信与计算重叠的流行解决办法。在本文中,我们发现许多DNNP有许多层次,每个层次的分布式培训只提供少量数据,使WFBP效率低下。基于将一些短通讯任务合并为单一任务可以减少整个通信时间这一事实,我们制定了一个优化问题,以尽量减少管道通信和计算的培训时间。我们找到一个最佳解决办法,可以在不影响培训业绩的情况下有效解决通信与计算重叠的问题。然后我们采用这一解决办法,提出一个分布式的培训算法,名为合并式WFBP(MG-WFBP),并在两个平台Cafe和PyTorch中实施。在三个GPUBP集群进行广泛的实验,以核实MC-WFBP的实效。我们进一步利用基于跟踪的模型模拟了4-BFFS的模型到2048 GPPPS的大幅提高G效率。我们进一步利用了BFMBF的模型,以展示了2048MBFP的模型,以更好地展示了2048MBFP的成绩。