Recently, Stochastic Gradient Descent (SGD) and its variants have become the dominant methods in the large-scale optimization of machine learning (ML) problems. A variety of strategies have been proposed for tuning the step sizes, ranging from adaptive step sizes to heuristic methods to change the step size in each iteration. Also, momentum has been widely employed in ML tasks to accelerate the training process. Yet, there is a gap in our theoretical understanding of them. In this work, we start to close this gap by providing formal guarantees to a few heuristic optimization methods and proposing improved algorithms. First, we analyze a generalized version of the AdaGrad (Delayed AdaGrad) step sizes in both convex and non-convex settings, showing that these step sizes allow the algorithms to automatically adapt to the level of noise of the stochastic gradients. We show for the first time sufficient conditions for Delayed AdaGrad to achieve almost sure convergence of the gradients to zero. Moreover, we present a high probability analysis for Delayed AdaGrad and its momentum variant in the non-convex setting. Second, we analyze SGD with exponential and cosine step sizes, which are empirically successful but lack theoretical support. We provide the very first convergence guarantees for them in the smooth and non-convex setting, with and without the Polyak-{\L}ojasiewicz (PL) condition. We also show their good property of adaptivity to noise under the PL condition. Third, we study the last iterate of momentum methods. We prove the first lower bound in the convex setting for the last iterate of SGD with constant momentum. Moreover, we investigate a class of Follow-The-Regularized-Leader-based momentum algorithms with increasing momentum and shrinking updates. We show that their last iterate has optimal convergence for unconstrained convex stochastic optimization problems.
翻译:最近,谷仓梯子(SGD)及其变异体成为大规模优化机器学习(ML)问题的主要方法。 我们提出了各种战略来调整步数大小, 从适应步数大小到超速方法, 以改变每个迭代的步数大小。 此外, 磁盘任务中广泛使用了动力来加速培训进程。 然而, 我们的理论理解存在差距。 在这项工作中, 我们开始缩小这一差距, 向少数超速优化方法提供正式保证, 并提出改进的算法。 首先, 我们分析了AdaGrad( 延迟的AdaGrad) 步数的通用版。 在螺旋和不折叠式结构设置中, 显示这些步数使算能自动适应振动速度, 加速速度梯度的振动状态。 我们第一次展示了延迟的更替条件, 使梯子更接近于零。 此外, 我们第一次对延迟的AdGradGrad( DeaGrad) 和它的动动动向下进行了高概率分析, 也显示它们最终的SGRAx 进阶级稳定状态, 显示我们最后的SGRA 和动力变变的SGD 。