Momentum Stochastic Gradient Descent (MSGD) algorithm has been widely applied to many nonconvex optimization problems in machine learning, e.g., training deep neural networks, variational Bayesian inference, and etc. Despite its empirical success, there is still a lack of theoretical understanding of convergence properties of MSGD. To fill this gap, we propose to analyze the algorithmic behavior of MSGD by diffusion approximations for nonconvex optimization problems with strict saddle points and isolated local optima. Our study shows that the momentum helps escape from saddle points, but hurts the convergence within the neighborhood of optima (if without the step size annealing or momentum annealing). Our theoretical discovery partially corroborates the empirical success of MSGD in training deep neural networks.
翻译:Momentum Stochastistic Gladient Emplement (MSGD) 算法被广泛应用于机器学习中的许多非电离优化问题,例如,培训深层神经网络、变异贝叶斯推论等等。 尽管它取得了成功,但对于MSGD的趋同特性仍缺乏理论上的理解。为了填补这一空白,我们提议通过使用严格的马鞍点和孤立的本地opima来传播非电离子优化问题的近似值来分析MSGD的算法行为。 我们的研究显示,这种势头有助于逃离马鞍点,但伤害了Popima附近地区的趋同(如果没有步尺寸的倾斜或动力的倾斜 ) 。 我们的理论发现部分地证实了MSGD在培训深神经网络方面的实验成功。