Gradient clipping is commonly used in training deep neural networks partly due to its practicability in relieving the exploding gradient problem. Recently, \citet{zhang2019gradient} show that clipped (stochastic) Gradient Descent (GD) converges faster than vanilla GD/SGD via introducing a new assumption called $(L_0, L_1)$-smoothness, which characterizes the violent fluctuation of gradients typically encountered in deep neural networks. However, their iteration complexities on the problem-dependent parameters are rather pessimistic, and theoretical justification of clipping combined with other crucial techniques, e.g. momentum acceleration, are still lacking. In this paper, we bridge the gap by presenting a general framework to study the clipping algorithms, which also takes momentum methods into consideration. We provide convergence analysis of the framework in both deterministic and stochastic setting, and demonstrate the tightness of our results by comparing them with existing lower bounds. Our results imply that the efficiency of clipping methods will not degenerate even in highly non-smooth regions of the landscape. Experiments confirm the superiority of clipping-based methods in deep learning tasks.
翻译:在培训深神经网络时,通常使用渐变剪报法,部分原因是它对于缓解爆炸梯度问题的实用性。最近,\citet{zhang2019gradient}显示,剪报的(随机)梯度底部(GD)与香草GD/SGD相交速度比香草GD/SGD要快,采用了一种叫作$(0.0,L_1)的新假设,这是在深神经网络中通常遇到的梯度剧烈波动的特点。然而,它们对于依赖问题的参数的反复复杂性相当悲观,而且剪报方法与其他关键技术(如加速势头)的理论解释仍然缺乏。在本文件中,我们通过提出研究剪报算算法的一般框架来弥补差距,同时也考虑到动力方法。我们提供了对确定性和随机环境框架的趋同分析,并通过比较现有较低界限来显示我们结果的紧凑性。我们的结果表明,剪辑方法的效率不会在高度非悬殊的地形任务中,即使是在高度非悬殊的区域也不会下降。