The state-of-the-art deep learning algorithms rely on distributed training systems to tackle the increasing sizes of models and training data sets. Minibatch stochastic gradient descent (SGD) algorithm requires workers to halt forward/back propagations, to wait for gradients aggregated from all workers, and to receive weight updates before the next batch of tasks. This synchronous execution model exposes the overheads of gradient/weight communication among a large number of workers in a distributed training system. We propose a new SGD algorithm, DaSGD (Local SGD with Delayed Averaging), which parallelizes SGD and forward/back propagations to hide 100% of the communication overhead. By adjusting the gradient update scheme, this algorithm uses hardware resources more efficiently and reduces the reliance on the low-latency and high-throughput inter-connects. The theoretical analysis and the experimental results show its convergence rate O(1/sqrt(K)), the same as SGD. The performance evaluation demonstrates it enables a linear performance scale-up with the cluster size.
翻译:最先进的深层次学习算法依靠分布式培训系统来解决模型和培训数据集规模不断扩大的问题。 Minibatch 随机梯度下降算法要求工人停止前/后传播,等待所有工人的梯度累积,并在下一批任务之前接受重量更新。这一同步执行模型暴露了分布式培训系统中大量工人之间梯度/重量通信的间接费用。我们提出了一个新的 SGD 算法,即DaSGD (与延迟变化相平行的本地SGD),该算法将SGD和前/后传播相平行,以隐藏100%的通信间接费用。通过调整梯度更新计划,该算法使用硬件资源的效率更高,并减少对低纬度和高通量互联的依赖。理论分析和实验结果显示其O(1/sqrt)(K)与SGD一样的趋同率。绩效评估表明,它能够使线性业绩与集群规模相提升。