We consider non-convex stochastic optimization using first-order algorithms for which the gradient estimates may have heavy tails. We show that a combination of gradient clipping, momentum, and normalized gradient descent yields convergence to critical points in high-probability with best-known rates for smooth losses when the gradients only have bounded $\mathfrak{p}$th moments for some $\mathfrak{p}\in(1,2]$. We then consider the case of second-order smooth losses, which to our knowledge have not been studied in this setting, and again obtain high-probability bounds for any $\mathfrak{p}$. Moreover, our results hold for arbitrary smooth norms, in contrast to the typical SGD analysis which requires a Hilbert space norm. Further, we show that after a suitable "burn-in" period, the objective value will monotonically decrease for every iteration until a critical point is identified, which provides intuition behind the popular practice of learning rate "warm-up" and also yields a last-iterate guarantee.
翻译:我们用梯度估计值可能有重尾的一阶算法来考虑非convex 蒸汽优化。 我们显示,梯度剪切、 动力和正常梯度下降结合, 在梯度仅约束$\ mathfrak{p} 美元( 1,2,2美元)时, 与最著名的平滑损失率相比, 与高概率的临界点趋同。 此外, 我们显示, 在适当的“ 燃烧” 期过后, 目标值将单数减少, 直到确定临界点, 从而提供流行的学习率“ 温升” 背后的直觉, 并产生最后的保证 。