The communication bottleneck has been a critical problem in large-scale distributed deep learning. In this work, we study distributed SGD with random block-wise sparsification as the gradient compressor, which is ring-allreduce compatible and highly computation-efficient but leads to inferior performance. To tackle this important issue, we improve the communication-efficient distributed SGD from a novel aspect, that is, the trade-off between the variance and second moment of the gradient. With this motivation, we propose a new detached error feedback (DEF) algorithm, which shows better convergence bound than error feedback for non-convex problems. We also propose DEF-A to accelerate the generalization of DEF at the early stages of the training, which shows better generalization bounds than DEF. Furthermore, we establish the connection between communication-efficient distributed SGD and SGD with iterate averaging (SGD-IA) for the first time. Extensive deep learning experiments show significant empirical improvement of the proposed methods under various settings.
翻译:在大规模分布式深层学习中,沟通瓶颈是一个关键问题。在这项工作中,我们研究将SGD以随机的分块式封闭作为梯度压缩机进行分配,这种压缩机是环形加压兼容和高计算效率的,但导致低性能。为了解决这一重要问题,我们从一个新的方面,即差异与梯度第二秒之间的权衡上,改进通信高效分布式 SGD。我们提出一种新的偏差反馈算法,这种算法比对非碳化问题的错误反馈的结合要好。我们还建议DEF-A在培训的早期阶段加快DEF的普及,这显示比DEF更好的概括性界限。此外,我们首次在通信高效分布式SGD和SGD平均(SGD-IA)之间建立了联系。广泛的深层次的实验表明,在不同环境下,拟议方法的经验有了重大改进。