We study convergence rates of AdaGrad-Norm as an exemplar of adaptive stochastic gradient methods (SGD), where the step sizes change based on observed stochastic gradients, for minimizing non-convex, smooth objectives. Despite their popularity, the analysis of adaptive SGD lags behind that of non adaptive methods in this setting. Specifically, all prior works rely on some subset of the following assumptions: (i) uniformly-bounded gradient norms, (ii) uniformly-bounded stochastic gradient variance (or even noise support), (iii) conditional independence between the step size and stochastic gradient. In this work, we show that AdaGrad-Norm exhibits an order optimal convergence rate of $\mathcal{O}\left(\frac{\mathrm{poly}\log(T)}{\sqrt{T}}\right)$ after $T$ iterations under the same assumptions as optimally-tuned non adaptive SGD (unbounded gradient norms and affine noise variance scaling), and crucially, without needing any tuning parameters. We thus establish that adaptive gradient methods exhibit order-optimal convergence in much broader regimes than previously understood.
翻译:我们研究AdaGrad-Norm的趋同率率,这是适应性随机梯度方法(SGD)的一个范例,根据观察到的随机梯度变化的步数大小,以尽量减少非凝固度,平稳的目标。尽管受到欢迎,但是对适应性 SGD的分析落后于这一环境的不适应性方法。具体地说,所有先前的工程都依赖于以下假设中的某些子数:(一) 统一性梯度规范,(二) 统一性随机梯度差异(或甚至噪音支持),(三) 步数和随机梯度梯度之间的有条件独立。在这项工作中,我们表明AdaGrad-Norm展示了美元的最高趋同率[mathcal{O ⁇ left(hraft)(frac_matthrm{poly ⁇ log(T)unsqrt{T ⁇ right),在与最佳调整性不适应性调整性SGDD(无约束性梯度的梯度规范和近似噪变缩)相同的假设下,而且关键地说,不需要任何更广义的趋同式的趋同式调整制度。因此,我们确立了一种适应性展示方法。