Although the distributed machine learning methods can speed up the training of large deep neural networks, the communication cost has become the non-negligible bottleneck to constrain the performance. To address this challenge, the gradient compression based communication-efficient distributed learning methods were designed to reduce the communication cost, and more recently the local error feedback was incorporated to compensate for the corresponding performance loss. However, in this paper, we will show that a new "gradient mismatch" problem is raised by the local error feedback in centralized distributed training and can lead to degraded performance compared with full-precision training. To solve this critical problem, we propose two novel techniques, 1) step ahead and 2) error averaging, with rigorous theoretical analysis. Both our theoretical and empirical results show that our new methods can handle the "gradient mismatch" problem. The experimental results show that we can even train faster with common gradient compression schemes than both the full-precision training and local error feedback regarding the training epochs and without performance loss.
翻译:尽管分布式机器学习方法可以加快大型深神经网络的培训,但通信成本已成为限制性能的不可忽略的瓶颈。为了应对这一挑战,基于梯度压缩的通信效率分布式学习方法旨在降低通信成本,最近还纳入了地方错误反馈,以弥补相应的性能损失。然而,在本文件中,我们将表明,集中式培训中的地方错误反馈提出了一个新的“严重不匹配”问题,并可能导致与全面精密培训相比的性能下降。为了解决这一关键问题,我们建议了两种新颖技术,一是先行技术,二是平均误差,同时进行严格的理论分析。我们的理论和经验结果都表明,我们的新方法能够解决“高度不匹配”问题。实验结果显示,我们甚至能够以共同的性能压缩计划进行更快的培训,而不是全面精准培训,以及当地对培训对象的错误反馈,而不造成性能损失。