The stochastic Polyak step size (SPS) has proven to be a promising choice for stochastic gradient descent (SGD), delivering competitive performance relative to state-of-the-art methods on smooth convex and non-convex optimization problems, including deep neural network training. However, extensions of this approach to non-smooth settings remain in their early stages, often relying on interpolation assumptions or requiring knowledge of the optimal solution. In this work, we propose a novel SPS variant, Safeguarded SPS (SPS$_{safe}$), for the stochastic subgradient method, and provide rigorous convergence guarantees for non-smooth convex optimization with no need for strong assumptions. We further incorporate momentum into the update rule, yielding equally tight theoretical results. On non-smooth convex benchmarks, our experiments are consistent with the theoretical predictions on how the safeguard affects the convergence neighborhood. On deep neural networks the proposed step size achieves competitive performance to existing adaptive baselines and exhibits stable behavior across a wide range of problem settings. Moreover, in these experiments, the gradient norms under our step size do not collapse to (near) zero, indicating robustness to vanishing gradients.
翻译:随机Polyak步长(SPS)已被证明是随机梯度下降(SGD)的一种有前景的选择,在光滑凸和非凸优化问题(包括深度神经网络训练)上,相对于最先进方法展现出具有竞争力的性能。然而,该方法向非光滑场景的扩展仍处于早期阶段,通常依赖于插值假设或需要已知最优解。在本工作中,我们提出了一种新颖的SPS变体——安全SPS(SPS$_{safe}$),用于随机次梯度方法,并为非光滑凸优化提供了严格的收敛保证,无需强假设。我们进一步将动量整合到更新规则中,得到了同样严密的理论结果。在非光滑凸基准测试中,我们的实验结果与关于安全机制如何影响收敛邻域的理论预测一致。在深度神经网络上,所提出的步长相较于现有自适应基线方法实现了具有竞争力的性能,并在广泛的问题设置中表现出稳定的行为。此外,在这些实验中,采用我们步长时的梯度范数不会坍缩至(接近)零,表明其对梯度消失具有鲁棒性。