Adaptive gradient methods have become popular in optimizing deep neural networks; recent examples include AdaGrad and Adam. Although Adam usually converges faster, variations of Adam, for instance, the AdaBelief algorithm, have been proposed to enhance Adam's poor generalization ability compared to the classical stochastic gradient method. This paper develops a generic framework for adaptive gradient methods that solve non-convex optimization problems. We first model the adaptive gradient methods in a state-space framework, which allows us to present simpler convergence proofs of adaptive optimizers such as AdaGrad, Adam, and AdaBelief. We then utilize the transfer function paradigm from classical control theory to propose a new variant of Adam, coined AdamSSM. We add an appropriate pole-zero pair in the transfer function from squared gradients to the second moment estimate. We prove the convergence of the proposed AdamSSM algorithm. Applications on benchmark machine learning tasks of image classification using CNN architectures and language modeling using LSTM architecture demonstrate that the AdamSSM algorithm improves the gap between generalization accuracy and faster convergence than the recent adaptive gradient methods.
翻译:在优化深层神经网络方面,适应性梯度方法已经很受欢迎;最近的例子包括AdaGrad和Adam。虽然亚当通常会比较快,但亚当的变异,例如Adabelief算法,是为了提高亚当与古典的随机梯度方法相比的差异概括性能力。本文为适应性梯度方法开发了一个通用框架,以解决非康氏优化问题。我们首先在州空间框架内模型了适应性梯度方法,使我们能够提出适应性优化者(如AdaGrad、Adam和Adabelief)的更简单的趋同证明。我们随后利用经典控制理论的转移功能模式提出亚当的新变异,生成了亚当SSM。我们在从平方梯度转换函数到第二时刻估计的转换功能中添加了适当的极零对。我们证明了拟议的亚当SSMSM算法的趋同性。应用CNN架构和LSTM结构的语言模型衡量图像分类的基准机器学习任务,表明亚当SM算法可以改善一般性准确性和比最近的适应性梯度方法更快的拉。