Stochastic Gradient Descent (SGD) is being used routinely for optimizing non-convex functions. Yet, the standard convergence theory for SGD in the smooth non-convex setting gives a slow sublinear convergence to a stationary point. In this work, we provide several convergence theorems for SGD showing convergence to a global minimum for non-convex problems satisfying some extra structural assumptions. In particular, we focus on two large classes of structured non-convex functions: (i) Quasar (Strongly) Convex functions (a generalization of convex functions) and (ii) functions satisfying the Polyak-Lojasiewicz condition (a generalization of strongly-convex functions). Our analysis relies on an Expected Residual condition which we show is a strictly weaker assumption than previously used growth conditions, expected smoothness or bounded variance assumptions. We provide theoretical guarantees for the convergence of SGD for different step-size selections including constant, decreasing and the recently proposed stochastic Polyak step-size. In addition, all of our analysis holds for the arbitrary sampling paradigm, and as such, we give insights into the complexity of minibatching and determine an optimal minibatch size. Finally, we show that for models that interpolate the training data, we can dispense of our Expected Residual condition and give state-of-the-art results in this setting.
翻译:常规地使用 SGD 标准趋同理论( SGD ) 。 然而, SGD 标准趋同理论( SGD ) 在 平滑的非 康维克斯 设置中, 向一个固定点的亚直线趋同缓慢。 在这项工作中, 我们为 SGD 提供一些趋同理论理论, 显示 SGD 与一个全球最低非康维克斯问题趋同, 满足一些额外的结构性假设。 特别是, 我们侧重于两大类结构化的非康维克斯 功能:(i) Quasar (Scurgly) Convex 函数( comvex 函数的概括化) 和 (ii) 满足 Polyak- Lojasiewicz 条件的函数( 强凝固功能的概括化) 的次线性子趋同点。 我们的分析基于一个预期的剩余性条件, 这表明, 与先前使用的增长条件、 预期的平滑度或受约束的差异假设相比, 极为弱于一个假设。 我们从理论上保证SGDGD 的趋同,,,, 包括恒、 和最近提议的多级分级分级分级的分级的分级 。 此外级的分级的分级的分级的分级。 此外, 我们的分级的分级的分级模型的分级的分级模型的分级模型的分级模型的分级模型的分级模型的分级模型的分级的分级模型的分级的分级, 我们的分级的分级的分级的分级的分级的分级的分级的分级的分级的分级模型的分级模型的分级模型的分级, 的分级模型的分级。