SGD with Momentum (SGDM) is a widely used family of algorithms for large-scale optimization of machine learning problems. Yet, when optimizing generic convex functions, no advantage is known for any SGDM algorithm over plain SGD. Moreover, even the most recent results require changes to the SGDM algorithms, like averaging of the iterates and a projection onto a bounded domain, which are rarely used in practice. In this paper, we focus on the convergence rate of the last iterate of SGDM. For the first time, we prove that for any constant momentum factor, there exists a Lipschitz and convex function for which the last iterate of SGDM suffers from a suboptimal convergence rate of $\Omega(\frac{\ln T}{\sqrt{T}})$ after $T$ iterations. Based on this fact, we study a class of (both adaptive and non-adaptive) Follow-The-Regularized-Leader-based SGDM algorithms with increasing momentum and shrinking updates. For these algorithms, we show that the last iterate has optimal convergence $O(\frac{1}{\sqrt{T}})$ for unconstrained convex stochastic optimization problems without projections onto bounded domains nor knowledge of $T$. Further, we show a variety of results for FTRL-based SGDM when used with adaptive stepsizes. Empirical results are shown as well.
翻译:SGD 与 Momentum (SGDM) 是大规模优化机器学习问题的一种广泛使用的算法组合。 然而,当优化通用调控功能时,任何SGDD的算法都无法在平坦的 SGD 上产生任何优势。 此外,即使是最近的结果也需要改变SGD的算法,比如平均迭代和投射到一个约束域,而在实践中很少使用。 在本文中,我们侧重于SGDM最后一次迭代的趋同率。 我们第一次证明,对于任何恒定的势头因素,存在一个Lipschitz 和 convex 的函数,因此,SGDM的最后一次迭代算法会受到美元(frac) 和 Tunsqrt{T ⁇ ) 的亚优异趋同率的困扰。 我们从这个事实出发,我们研究一个(适应性和非适应性的) 后续的SGDM(REDM) 和不断的更新的调控 。 对于这些算法领域来说,我们展示了最优的调调调 。