Following the same routine as [SSJ20], we continue to present the theoretical analysis for stochastic gradient descent with momentum (SGD with momentum) in this paper. Differently, for SGD with momentum, we demonstrate it is the two hyperparameters together, the learning rate and the momentum coefficient, that play the significant role for the linear rate of convergence in non-convex optimization. Our analysis is based on the use of a hyperparameters-dependent stochastic differential equation (hp-dependent SDE) that serves as a continuous surrogate for SGD with momentum. Similarly, we establish the linear convergence for the continuous-time formulation of SGD with momentum and obtain an explicit expression for the optimal linear rate by analyzing the spectrum of the Kramers-Fokker-Planck operator. By comparison, we demonstrate how the optimal linear rate of convergence and the final gap for SGD only about the learning rate varies with the momentum coefficient increasing from zero to one when the momentum is introduced. Then, we propose a mathematical interpretation why the SGD with momentum converges faster and more robust about the learning rate than the standard SGD in practice. Finally, we show the Nesterov momentum under the existence of noise has no essential difference with the standard momentum.
翻译:遵循与[SSJ20]相同的常规,我们继续在本文件中介绍关于随机梯度下降的理论分析,并有动力(SGD,有动力)的理论分析。不同的是,对于有动力的SGD而言,我们展示的是两个超参数,即学习率和动力系数,这在非康韦克斯优化中的线性趋同率方面起着重要作用。我们的分析以使用依赖超光谱的随机梯度差分方程(hp-依赖SDE)为基础,该方程作为SGD的连续代谢器。同样,我们通过分析克拉默斯-福克-普朗克操作员的频谱,为SGD的连续时间配制建立线性趋同,并获得最佳线性速度的明确表达。相比之下,我们展示了SGD的最佳线性趋同率和最后差距是如何在引入动力系数时从0到1的。然后,我们提出数学解释,为什么SGD与动力的趋同速度比标准的SGD差异更迅速、更牢固地接近。最后,我们展示了标准的SGD的形成势头。