In this work, we propose new adaptive step size strategies that improve several stochastic gradient methods. Our first method (StoPS) is based on the classical Polyak step size (Polyak, 1987) and is an extension of the recent development of this method for the stochastic optimization-SPS (Loizou et al., 2021), and our second method, denoted GraDS, rescales step size by "diversity of stochastic gradients". We provide a theoretical analysis of these methods for strongly convex smooth functions and show they enjoy deterministic-like rates despite stochastic gradients. Furthermore, we demonstrate the theoretical superiority of our adaptive methods on quadratic objectives. Unfortunately, both StoPS and GraDS depend on unknown quantities, which are only practical for the overparametrized models. To remedy this, we drop this undesired dependence and redefine StoPS and GraDS to StoP and GraD, respectively. We show that these new methods converge linearly to the neighbourhood of the optimal solution under the same assumptions. Finally, we corroborate our theoretical claims by experimental validation, which reveals that GraD is particularly useful for deep learning optimization.
翻译:在这项工作中,我们提出新的适应性步骤规模战略,改进几种随机梯度方法。我们的第一种方法(StoPS)基于古典的Polyak步骤规模(1987年,Pollyak),是最近开发这种方法的延伸,用于Stochacistic优化-SPS(Loizou等人,2021年)和我们的第二种方法(称为GraDS),用“随机梯度多样性”来重新缩放这些步骤规模。我们对这些方法进行了理论分析,以利得性很强的平滑功能,并表明尽管有随机梯度,它们仍享有类似确定性的比率。此外,我们还展示了我们适应方法在二次目标上的理论优势。不幸的是,StoPS和GraDS都依赖于未知的数量,而这些数量仅对过度平衡模型是实用的。为了纠正这一点,我们分别将这种不理想的依赖性重新定义StopS和GraDS的步度大小。我们显示了这些新方法在相同的假设下线性地与最佳解决办法的相邻。最后,我们通过实验性论证证实了我们的理论主张,表明GraD最优化对于学习特别有用。