Adaptive gradient methods including Adam, AdaGrad, and their variants have been very successful for training deep learning models, such as neural networks. Meanwhile, given the need for distributed computing, distributed optimization algorithms are rapidly becoming a focal point. With the growth of computing power and the need for using machine learning models on mobile devices, the communication cost of distributed training algorithms needs careful consideration. In this paper, we introduce novel convergent decentralized adaptive gradient methods and rigorously incorporate adaptive gradient methods into decentralized training procedures. Specifically, we propose a general algorithmic framework that can convert existing adaptive gradient methods to their decentralized counterparts. In addition, we thoroughly analyze the convergence behavior of the proposed algorithmic framework and show that if a given adaptive gradient method converges, under some specific conditions, then its decentralized counterpart is also convergent. We illustrate the benefit of our generic decentralized framework on a prototype method, i.e., AMSGrad, both theoretically and numerically.
翻译:适应性梯度方法,包括Adam, AdaGrad, 及其变种,在培训神经网络等深层学习模型方面非常成功。 同时,由于需要分布式计算,分布式优化算法正在迅速成为一个协调中心。随着计算能力的增长和在移动设备上使用机器学习模型的需要,分布式培训算法的通信成本需要仔细考虑。在本文中,我们引入了新颖的分散式分散式适应性梯度方法,并将适应性梯度方法严格纳入分散式培训程序。具体地说,我们提议了一个一般算法框架,将现有的适应性梯度方法转换到分散式的对应方。此外,我们透彻分析拟议的算法框架的趋同行为,并表明如果特定适应性梯度方法在某些特定条件下趋于一致,那么分散式的对应方也会趋于一致。我们从理论上和数字上说明了我们通用的分散化框架在原型方法(即AMSGrad)上的好处。