In the vanishing learning rate regime, stochastic gradient descent (SGD) is now relatively well understood. In this work, we propose to study the basic properties of SGD and its variants in the non-vanishing learning rate regime. The focus is on deriving exactly solvable results and discussing their implications. The main contributions of this work are to derive the stationary distribution for discrete-time SGD in a quadratic loss function with and without momentum; in particular, one implication of our result is that the fluctuation caused by discrete-time dynamics takes a distorted shape and is dramatically larger than a continuous-time theory could predict. Examples of applications of the proposed theory considered in this work include the approximation error of variants of SGD, the effect of minibatch noise, the optimal Bayesian inference, the escape rate from a sharp minimum, and the stationary distribution of a few second-order methods including damped Newton's method and natural gradient descent.
翻译:在逐渐消失的学习率制度下,目前对随机梯度梯度下降(SGD)的认识相对较深,在这项工作中,我们提议研究SGD及其变种在非损耗学习率制度中的基本特性,重点是得出完全可以溶解的结果并讨论其影响。这项工作的主要贡献是,在具有和没有动力的四级流失功能中得出离散时间 SGD 的固定分布;特别是,我们结果的一个影响是,离散时间动态造成的波动的形状扭曲,大大大于连续时间理论可以预测的幅度。在这项工作中考虑的拟议理论的应用实例包括SGD变种的近似错误、微型噪声的效果、最佳贝叶斯语推论、敏锐最低的逃逸率,以及少数次级方法的固定分布,包括达马达·牛顿方法和自然梯度下降。