We study the application of variance reduction (VR) techniques to general non-convex stochastic optimization problems. In this setting, the recent work STORM [Cutkosky-Orabona '19] overcomes the drawback of having to compute gradients of "mega-batches" that earlier VR methods rely on. There, STORM utilizes recursive momentum to achieve the VR effect and is then later made fully adaptive in STORM+ [Levy et al., '21], where full-adaptivity removes the requirement for obtaining certain problem-specific parameters such as the smoothness of the objective and bounds on the variance and norm of the stochastic gradients in order to set the step size. However, STORM+ crucially relies on the assumption that the function values are bounded, excluding a large class of useful functions. In this work, we propose META-STORM, a generalized framework of STORM+ that removes this bounded function values assumption while still attaining the optimal convergence rate for non-convex optimization. META-STORM not only maintains full-adaptivity, removing the need to obtain problem specific parameters, but also improves the convergence rate's dependency on the problem parameters. Furthermore, META-STORM can utilize a large range of parameter settings that subsumes previous methods allowing for more flexibility in a wider range of settings. Finally, we demonstrate the effectiveness of META-STORM through experiments across common deep learning tasks. Our algorithm improves upon the previous work STORM+ and is competitive with widely used algorithms after the addition of per-coordinate update and exponential moving average heuristics.
翻译:我们研究将差异减少(VR)技术应用于一般的非 convex 随机优化问题。 在这种环境下,最近StorM [Cutkosky- Orabona'19] 的工作克服了计算早期VR方法所依赖的“ 超巴格” 梯度的缺陷。 在那里,StorM 利用回溯动力来实现 VR 效应,然后在STORM + [Levy et al., '21] 中完全适应。 完全适应性消除了获得某些特定问题参数的要求,例如目标的平稳性以及对于测深相梯度差异和规范的界限,以设定步数大小。 然而,StorM+ 关键地依赖于函数被捆绑的假设, 不包括大量的有用功能。 在这项工作中, 我们提议 META- StorM 的通用框架, 消除了这一约束性函数假设,同时仍然达到非conx 优化的深度优化值。 IMA- Staryloralalal- lax 更新了前的常规范围, IM 也只是保持了前一个特定的缩缩缩 。