Non-convex optimization problems are ubiquitous in machine learning, especially in Deep Learning. While such complex problems can often be successfully optimized in practice by using stochastic gradient descent (SGD), theoretical analysis cannot adequately explain this success. In particular, the standard analyses do not show global convergence of SGD on non-convex functions, and instead show convergence to stationary points (which can also be local minima or saddle points). We identify a broad class of nonconvex functions for which we can show that perturbed SGD (gradient descent perturbed by stochastic noise -- covering SGD as a special case) converges to a global minimum (or a neighborhood thereof), in contrast to gradient descent without noise that can get stuck in local minima far from a global solution. For example, on non-convex functions that are relatively close to a convex-like (strongly convex or PL) function we show that SGD can converge linearly to a global optimum.
翻译:非混凝土优化问题在机器学习中普遍存在,特别是在深层学习中。虽然这些复杂问题通常可以通过使用随机梯度梯度下降(SGD)在实践上成功地优化,但理论分析无法充分解释这一成功。特别是,标准分析没有显示SGD在非convex函数上的全球趋同,而是显示与固定点(也可以是局部微型点或马鞍点)的趋同。我们找出了一个广泛的非混凝土函数类别,我们可以显示,受扰动的 SGD(由随机噪声渗透的渐渐下降 -- -- 将SGD作为特例涵盖,将SGD作为特例覆盖)聚集到一个全球最低值(或附近地区),而相对于梯度下降而言,没有噪音能够困在远离全球解决方案的本地迷你马中。例如,关于相对接近于凝固式(强凝聚或PL)功能的非凝固式函数,我们显示,SGDD可以线性地趋同全球最佳。