In the vanishing learning rate regime, stochastic gradient descent (SGD) is now relatively well understood. In this work, we propose to study the basic properties of SGD and its variants in the non-vanishing learning rate regime. The focus is on deriving exactly solvable results and discussing their implications. The main contributions of this work are to derive the stationary distribution for discrete-time SGD in a quadratic loss function with and without momentum; in particular, one implication of our result is that the fluctuation caused by discrete-time dynamics takes a distorted shape and is dramatically larger than a continuous-time theory could predict. Examples of applications of the proposed theory considered in this work include the approximation error of variants of SGD, the effect of minibatch noise, the optimal Bayesian inference, the escape rate from a sharp minimum, and the stationary covariance of a few second-order methods including damped Newton's method, natural gradient descent, and Adam.
翻译:在逐渐消失的学习率制度下,目前对随机梯度梯度下降(SGD)的认识相对较深,在这项工作中,我们提议研究SGD及其变种在非损耗学习率制度中的基本特性,重点是得出完全可以溶解的结果并讨论其影响。这项工作的主要贡献是将离散时间SGD的固定分布在具有和没有动力的四级损失函数中;特别是,我们结果的一个影响是,离散时间动态造成的波动的形状被扭曲,大大大于连续时间理论可以预测的幅度。在这项工作中考虑的拟议理论的应用实例包括SGD变种的近似错误、微波噪音的影响、最佳贝叶斯推断、从最短的最小距离逃逸率,以及包括达姆牛顿方法、自然梯度下降和亚当在内的少数二级方法的固定共变。