如何使渐变小小斯托沙斯:甚至更快的 Convex 和非convex SGD (How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD)

from arxiv, V2 added two applications to nonconvex stochastic optimization, and V3 corrects a citation. arXiv admin note: text overlap with arXiv:1708.08694

Stochastic gradient descent (SGD) gives an optimal convergence rate when minimizing convex stochastic objectives $f(x)$. However, in terms of making the gradients small, the original SGD does not give an optimal rate, even when $f(x)$ is convex. If $f(x)$ is convex, to find a point with gradient norm $\varepsilon$, we design an algorithm SGD3 with a near-optimal rate $\tilde{O}(\varepsilon^{-2})$, improving the best known rate $O(\varepsilon^{-8/3})$ of [18]. If $f(x)$ is nonconvex, to find its $\varepsilon$-approximate local minimum, we design an algorithm SGD5 with rate $\tilde{O}(\varepsilon^{-3.5})$, where previously SGD variants only achieve $\tilde{O}(\varepsilon^{-4})$ [6, 15, 33]. This is no slower than the best known stochastic version of Newton's method in all parameter regimes [30].

翻译：软性梯度下降(SGD) 在最小化二次曲线目标时, 给出最佳趋同率 $f(x) 美元。然而, 在使梯度小化方面, 原始 SGD 并没有给出最佳率, 即使$f(x) 美元是 convex 。如果 $f(x) 美元是 convex, 我们设计了一个具有梯度规范的点 $\ varepsilon{O} (\ varepsilon}-2} 美元, 提高已知的美元[18] 的最佳率。如果 $f(x) 美元是非convex, 要找到其 $\ varepslon $- pappoint 当地最低值, 我们设计了一个具有 $\ tilde{O} (\ varepslon}- 3. } 美元的SGD3 3, 以前SGD变量仅达到 $\\\ {O} (\ varepslon_4} $ [6,, 15, 33] 将所有已知的Restsmissionsmation 最慢的S.