Stochastic gradient descent (SGD) is one of the most popular algorithms in modern machine learning. The noise encountered in these applications is different from that in many theoretical analyses of stochastic gradient algorithms. In this article, we discuss some of the common properties of energy landscapes and stochastic noise encountered in machine learning problems, and how they affect SGD-based optimization. In particular, we show that the learning rate in SGD with machine learning noise can be chosen to be small, but uniformly positive for all times if the energy landscape resembles that of overparametrized deep learning problems. If the objective function satisfies a Lojasiewicz inequality, SGD converges to the global minimum exponentially fast, and even for functions which may have local minima, we establish almost sure convergence to the global minimum at an exponential rate from any finite energy initialization. The assumptions that we make in this result concern the behavior where the objective function is either small or large and the nature of the gradient noise, but the energy landscape is fairly unconstrained on the domain where the objective function takes values in an intermediate regime.
翻译:在现代机器学习中最受欢迎的算法之一。 这些应用中遇到的噪音与许多随机梯度算法的理论分析不同。 在本篇文章中,我们讨论了在机器学习问题中遇到的能源景观和随机噪声的一些共同特性,以及它们如何影响基于SGD的优化。特别是,我们表明,SGD中机器学习噪音的学习率可以被选为小的,但如果能源景观类似于过度平衡的深层学习问题,则在任何时候都具有统一的积极性。如果客观功能满足了Lojasiewicz的不平等,SGD会快速地聚集到全球最低值,甚至对于可能具有本地微量值的功能,我们几乎可以确定与全球最低值的趋同程度,从任何有限的能源初始化中以指数速度计算。我们由此得出的假设涉及目标功能大小和易变音性质的行为,但是在客观函数在中间系统中占据价值的领域,能源景观相当松散。