Due to the explosion in the size of the training datasets, distributed learning has received growing interest in recent years. One of the major bottlenecks is the large communication cost between the central server and the local workers. While error feedback compression has been proven to be successful in reducing communication costs with stochastic gradient descent (SGD), there are much fewer attempts in building communication-efficient adaptive gradient methods with provable guarantees, which are widely used in training large-scale machine learning models. In this paper, we propose a new communication-compressed AMSGrad for distributed nonconvex optimization problem, which is provably efficient. Our proposed distributed learning framework features an effective gradient compression strategy and a worker-side model update design. We prove that the proposed communication-efficient distributed adaptive gradient method converges to the first-order stationary point with the same iteration complexity as uncompressed vanilla AMSGrad in the stochastic nonconvex optimization setting. Experiments on various benchmarks back up our theory.
翻译:由于培训数据集规模的爆炸性,分布式学习近年来引起了越来越多的兴趣。主要瓶颈之一是中央服务器和当地工人之间的大量通信成本。虽然错误反馈压缩已证明成功地降低了与随机梯度下降的通信成本(SGD),但在建立通信效率高的适应性梯度方法方面,有可变保障的尝试要少得多,这些保障广泛用于培训大型机器学习模式。在本文中,我们提议为分布式非电解器优化问题提出一个新的通信压缩的AMSGrad,这是非常有效的。我们拟议的分布式学习框架具有有效的梯度压缩战略和工人-侧模式更新设计。我们证明,拟议的通信效率高的分布式梯度方法与在Stochaticnconvex优化设置中未压缩的凡尼拉·AMSGrad一样复杂,与不压缩式非convex最优化设置中的Vanilla AMSGrad相同。在各种基准上进行实验,以支持我们的理论。