As a simple and efficient optimization method in deep learning, stochastic gradient descent (SGD) has attracted tremendous attention. In the vanishing learning rate regime, SGD is now relatively well understood, and the majority of theoretical approaches to SGD set their assumptions in the continuous-time limit. However, the continuous-time predictions are unlikely to reflect the experimental observations well because the practice often runs in the large learning rate regime, where the training is faster and the generalization of models are often better. In this paper, we propose to study the basic properties of SGD and its variants in the non-vanishing learning rate regime. The focus is on deriving exactly solvable results and relating them to experimental observations. The main contributions of this work are to derive the stable distribution for discrete-time SGD in a quadratic loss function with and without momentum. Examples of applications of the proposed theory considered in this work include the approximation error of variants of SGD, the effect of mini-batch noise, the escape rate from a sharp minimum, and and the stationary distribution of a few second order methods.
翻译:在深层学习中,作为一种简单而有效的优化方法,随机梯度下降引起了极大的注意。在逐渐消失的学习率制度中,SGD现在相对被很好地理解,对SGD的理论方法大多将其假设设定为连续时间限制,然而,连续时间预测不可能很好地反映实验性观察,因为这种做法往往发生在大型学习率制度中,在这种制度中,培训速度更快,模型的概括化往往更好。在本文中,我们提议研究SGD及其变异在非消亡学习率制度中的基本特性。重点是得出完全可以溶解的结果,并将其与实验性观察联系起来。这项工作的主要贡献是使独立时间SGD稳定地分布在具有和没有动力的四分位损失函数中。在这项工作中考虑的拟议理论的应用实例包括SGD变异的近似错误、微量噪声的效果、从极小的最低限度逃逸率和少数第二顺序方法的固定分布。