Stochastic Gradient Decent (SGD) is one of the core techniques behind the success of deep neural networks. The gradient provides information on the direction in which a function has the steepest rate of change. The main problem with basic SGD is to change by equal sized steps for all parameters, irrespective of gradient behavior. Hence, an efficient way of deep network optimization is to make adaptive step sizes for each parameter. Recently, several attempts have been made to improve gradient descent methods such as AdaGrad, AdaDelta, RMSProp and Adam. These methods rely on the square roots of exponential moving averages of squared past gradients. Thus, these methods do not take advantage of local change in gradients. In this paper, a novel optimizer is proposed based on the difference between the present and the immediate past gradient (i.e., diffGrad). In the proposed diffGrad optimization technique, the step size is adjusted for each parameter in such a way that it should have a larger step size for faster gradient changing parameters and a lower step size for lower gradient changing parameters. The convergence analysis is done using the regret bound approach of online learning framework. Rigorous analysis is made in this paper over three synthetic complex non-convex functions. The image categorization experiments are also conducted over the CIFAR10 and CIFAR100 datasets to observe the performance of diffGrad with respect to the state-of-the-art optimizers such as SGDM, AdaGrad, AdaDelta, RMSProp, AMSGrad, and Adam. The residual unit (ResNet) based Convolutional Neural Networks (CNN) architecture is used in the experiments. The experiments show that diffGrad outperforms other optimizers. Also, we show that diffGrad performs uniformly well for training CNN using different activation functions. The source code is made publicly available at https://github.com/shivram1987/diffGrad.
翻译:SGD 是基础 SGD 的主要难题是对所有参数, 不论梯度行为如何, 以等同大小的步骤进行修改。 因此, 一个高效的深网络优化方式是每个参数的适应步骤大小。 最近, 尝试了几次改进梯度下移方法, 如 AdaGrad、 AdaDelta、 RMSProp 和 Adam 。 这些方法依赖于一个函数的指数移动平均值的平方根, 变化速度最快。 因此, 这些方法并不利用本地梯度变化的优势。 基础 SGD 的主要问题是对所有参数, 不论梯度行为如何。 因此, 一种高效的深网络优化方式是使每个参数具有适应性步骤大小。 在提议的 diff Grash 优化技术中, 每个参数的阶梯度都进行了调整, 以更大幅度的梯度变化参数和低梯度变化参数的平方根值。 因此, 这些方法不利用本地的 ASFARDMDM 的亚性实验, 也使用了这样的内部分析 。