Stochastic gradient descent (SGD) with stochastic momentum is popular in nonconvex stochastic optimization and particularly for the training of deep neural networks. In standard SGD, parameters are updated by improving along the path of the gradient at the current iterate on a batch of examples, where the addition of a ``momentum'' term biases the update in the direction of the previous change in parameters. In non-stochastic convex optimization one can show that a momentum adjustment provably reduces convergence time in many settings, yet such results have been elusive in the stochastic and non-convex settings. At the same time, a widely-observed empirical phenomenon is that in training deep networks stochastic momentum appears to significantly improve convergence time, variants of it have flourished in the development of other popular update methods, e.g. ADAM [KB15], AMSGrad [RKK18], etc. Yet theoretical justification for the use of stochastic momentum has remained a significant open question. In this paper we propose an answer: stochastic momentum improves deep network training because it modifies SGD to escape saddle points faster and, consequently, to more quickly find a second order stationary point. Our theoretical results also shed light on the related question of how to choose the ideal momentum parameter--our analysis suggests that $\beta \in [0,1)$ should be large (close to 1), which comports with empirical findings. We also provide experimental findings that further validate these conclusions.
翻译:具有沙变动力的沙变梯度下降(SGD)在非沙变优化中很受欢迎,特别是在深神经网络的培训方面。在标准 SGD 中,参数通过在一系列实例上改进当前迭代点的梯度路径而更新,在一系列实例中,添加“momentum”一词使“momentum”一词偏向于前一次参数变化方向的更新。在非沙变锥形优化中,人们可以表明,动力调整可明显地缩短了许多环境的趋同时间,但这种结果在随机和非凝聚环境的设置中却难以找到。与此同时,一个广泛观察到的经验现象是,在培训深网络的梯度趋势过程中,似乎大大改进了趋同时间,在开发其他流行的更新方法时,例如:ADAM[KB15]、AMSGrad[RK18]等。然而,使用沙变动力的理论理由仍然是一个重大的公开问题。在本文中,我们提出了一个答案:在深度和非凝聚的设置的设置中,在深度研究动力中,我们所观察到的一个快速的实验性动力变化势头,也显示我们最快速的网络分析结果。