The scale of deep learning nowadays calls for efficient distributed training algorithms. Decentralized momentum SGD (DmSGD), in which each node averages only with its neighbors, is more communication efficient than vanilla Parallel momentum SGD that incurs global average across all computing nodes. On the other hand, the large-batch training has been demonstrated critical to achieve runtime speedup. This motivates us to investigate how DmSGD performs in the large-batch scenario. In this work, we find the momentum term can amplify the inconsistency bias in DmSGD. Such bias becomes more evident as batch-size grows large and hence results in severe performance degradation. We next propose DecentLaM, a novel decentralized large-batch momentum SGD to remove the momentum-incurred bias. The convergence rate for both non-convex and strongly-convex scenarios is established. Our theoretical results justify the superiority of DecentLaM to DmSGD especially in the large-batch scenario. Experimental results on a variety of computer vision tasks and models demonstrate that DecentLaM promises both efficient and high-quality training.
翻译:目前深层次学习的规模要求高效分布式培训算法。 每个节点仅与邻国平均的分散动力 SGD(DmSGD)比所有计算节点全球均值的香草平行动力SGD(SGD)更具沟通效率。另一方面,大型批量培训已证明对实现运行速度加快至关重要。这促使我们调查DmSGD在大型批量情景中的表现。在这项工作中,我们发现动力术语可以扩大DmSGD中的不一致偏差。随着批量规模的扩大,这种偏差会变得更加明显,从而导致严重性能退化。我们接下来提议“DrigalLaM”(SGD),这是一个新的分散式大型批量动力驱动动力SGD(SGD),以消除动力错失的偏差。非凝固型情景的趋同率已经确立。我们的理论结果证明,体面LaM至DmSGD(特别是在大型批量情景中)的优越性。在各种计算机视觉任务和模型上的实验结果表明,光成像LAM将保证高效和高质量的培训。