This work considers the problem of finding a first-order stationary point of a non-convex function with potentially unbounded smoothness constant using a stochastic gradient oracle. We focus on the class of $(L_0,L_1)$-smooth functions proposed by Zhang et al. (ICLR'20). Empirical evidence suggests that these functions more closely captures practical machine learning problems as compared to the pervasive $L_0$-smoothness. This class is rich enough to include highly non-smooth functions, such as $\exp(L_1 x)$ which is $(0,\mathcal{O}(L_1))$-smooth. Despite the richness, an emerging line of works achieves the $\widetilde{\mathcal{O}}(\frac{1}{\sqrt{T}})$ rate of convergence when the noise of the stochastic gradients is deterministically and uniformly bounded. This noise restriction is not required in the $L_0$-smooth setting, and in many practical settings is either not satisfied, or results in weaker convergence rates with respect to the noise scaling of the convergence rate. We develop a technique that allows us to prove $\mathcal{O}(\frac{\mathrm{poly}\log(T)}{\sqrt{T}})$ convergence rates for $(L_0,L_1)$-smooth functions without assuming uniform bounds on the noise support. The key innovation behind our results is a carefully constructed stopping time $\tau$ which is simultaneously "large" on average, yet also allows us to treat the adaptive step sizes before $\tau$ as (roughly) independent of the gradients. For general $(L_0,L_1)$-smooth functions, our analysis requires the mild restriction that the multiplicative noise parameter $\sigma_1 < 1$. For a broad subclass of $(L_0,L_1)$-smooth functions, our convergence rate continues to hold when $\sigma_1 \geq 1$. By contrast, we prove that many algorithms analyzed by prior works on $(L_0,L_1)$-smooth optimization diverge with constant probability even for smooth and strongly-convex functions when $\sigma_1 > 1$.
翻译:这项工作考虑了找到一个非commox 稳定点的问题。 这个类别足够丰富, 包括高度非移动性功能, 例如 $\ 美元( L_ 1x), 美元( 0, mathal_ o) 的振动常态 。 我们的焦点是 $( L_ 0, L_ 1), 张等人( ICL'20 ) 提议的 美元( ICLR'20) 的平滑性函数。 经验证据表明, 这些功能更密切地反映了实际的机器学习问题, 而不是普遍 $( 0美元) 的振动性 。 这个类别足够包含高度非移动性功能, 例如 $( L_ 1x) 美元( L_ 1x) 的振动常态, 美元( 美元) 美元( 美元) 的振荡性值( 0. 0, 美元) 美元( mathal_ 美元) 的振动常态, 而在许多实际的游戏中, 我们的渐变的渐变的渐变的渐变的渐变的渐变的磁率,, 也使得我们的磁变的结果变得不满足了。