Distributed stochastic gradient descent (SGD) with gradient compression has emerged as a communication-efficient solution to accelerate distributed learning. Top-K sparsification is one of the most popular gradient compression methods that sparsifies the gradient in a fixed degree during model training. However, there lacks an approach to adaptively adjust the degree of sparsification to maximize the potential of model performance or training speed. This paper addresses this issue by proposing a novel adaptive Top-K SGD framework, enabling adaptive degree of sparsification for each gradient descent step to maximize the convergence performance by exploring the trade-off between communication cost and convergence error. Firstly, we derive an upper bound of the convergence error for the adaptive sparsification scheme and the loss function. Secondly, we design the algorithm by minimizing the convergence error under the communication cost constraints. Finally, numerical results show that the proposed adaptive Top-K in SGD achieves a significantly better convergence rate compared with the state-of-the-art methods.
翻译:使用梯度压缩的分布式梯度梯度下降(SGD)已成为加速分布式学习的一种通信效率高的解决方案。Top-K Estericization是最流行的梯度压缩方法之一,在模型培训期间将梯度以固定的程度加以拉伸。然而,缺乏一种适应性调整宽度的方法,以最大限度地发挥模型性能或培训速度的潜力。本文件通过提出一个新的适应性高K SGD框架来解决这一问题,使每个梯度梯度下降步骤的适应性拉伸度能够通过探索通信成本和汇合误差来最大限度地提高趋同性能。首先,我们从适应性缓解方案和损失函数的趋同性差中得出了一个上界。第二,我们设计算法时尽量减少通信成本限制下的趋同性差。最后,数字结果显示,SGD中拟议的适应性T-K与最先进的方法相比,其趋同率要高得多。