Stochastic gradient algorithms are often unstable when applied to functions that do not have Lipschitz-continuous and/or bounded gradients. Gradient clipping is a simple and effective technique to stabilize the training process for problems that are prone to the exploding gradient problem. Despite its widespread popularity, the convergence properties of the gradient clipping heuristic are poorly understood, especially for stochastic problems. This paper establishes both qualitative and quantitative convergence results of the clipped stochastic (sub)gradient method (SGD) for non-smooth convex functions with rapidly growing subgradients. Our analyses show that clipping enhances the stability of SGD and that the clipped SGD algorithm enjoys finite convergence rates in many cases. We also study the convergence of a clipped method with momentum, which includes clipped SGD as a special case, for weakly convex problems under standard assumptions. With a novel Lyapunov analysis, we show that the proposed method achieves the best-known rate for the considered class of problems, demonstrating the effectiveness of clipped methods also in this regime. Numerical results confirm our theoretical developments.
翻译:当应用到没有Lipschitz持续和(或)捆绑梯度的函数时,沙变梯度算法往往不稳定。渐变剪切是一种简单而有效的技术,可以稳定容易引发梯度爆炸的问题的培训过程。尽管其广受欢迎,但梯变剪切脂质的趋同特性不易理解,特别是对于沙变问题。本文件确定了剪切的沙变(次)梯度(次)法(SGD)在质量和数量两方面的趋同结果,用于与快速增长的亚梯度相交的非moose convex函数。我们的分析显示,剪切的SGD算法提高了SGD的稳定性,而且在许多情况下,剪切的SGD算法具有有限的趋同率。我们还研究了剪切方法与势头的趋同性,其中包括剪切的SGD,作为标准假设下较弱的粘结问题的特殊案例。我们用新的Lyapunov分析显示,拟议的方法在质量和数量上达到了最为人所知的水平,显示了这个制度下剪切方法的有效性。