Accelerated gradient-based methods are being extensively used for solving non-convex machine learning problems, especially when the data points are abundant or the available data is distributed across several agents. Two of the prominent accelerated gradient algorithms are AdaGrad and Adam. AdaGrad is the simplest accelerated gradient method, which is particularly effective for sparse data. Adam has been shown to perform favorably in deep learning problems compared to other methods. In this paper, we propose a new fast optimizer, Generalized AdaGrad (G-AdaGrad), for accelerating the solution of potentially non-convex machine learning problems. Specifically, we adopt a state-space perspective for analyzing the convergence of gradient acceleration algorithms, namely G-AdaGrad and Adam, in machine learning. Our proposed state-space models are governed by ordinary differential equations. We present simple convergence proofs of these two algorithms in the deterministic settings with minimal assumptions. Our analysis also provides intuition behind improving upon AdaGrad's convergence rate. We provide empirical results on MNIST dataset to reinforce our claims on the convergence and performance of G-AdaGrad and Adam.
翻译:加速梯度法正在被广泛用于解决非冷冻机学习问题,特别是当数据点充足或现有数据分布于多个代理商时。两种显著的加速梯度算法是AdaGrad和Adam。AdaGrad是最简单的加速梯度法,对稀有数据特别有效。亚当已证明与其他方法相比,在深层学习问题方面表现优异。在本文中,我们提议一个新的快速优化器,即普遍化的AdaGrad(G-AdaGrad),以加速解决潜在的非冷冻机学习问题。具体地说,我们从州空间角度分析机械学习中梯度加速算法的趋同,即G-AdaGrad和Adam。我们提议的州-空间模型受普通差异方程式的制约。我们用最起码的假设来展示在确定性环境中这两种算法的简单趋同证据。我们的分析还提供了改进AdaGrad和Adam汇合率背后的直觉。我们提供了关于MNIST数据集的经验结果,以加强我们关于G-AdaGrad和Adam的趋同和表现的主张。