Adaptive optimization methods have been widely used in deep learning. They scale the learning rates adaptively according to the past gradient, which has been shown to be effective to accelerate the convergence. However, they suffer from poor generalization performance compared with SGD. Recent studies point that smoothing exponential gradient noise leads to generalization degeneration phenomenon. Inspired by this, we propose AdaL, with a transformation on the original gradient. AdaL accelerates the convergence by amplifying the gradient in the early stage, as well as dampens the oscillation and stabilizes the optimization by shrinking the gradient later. Such modification alleviates the smoothness of gradient noise, which produces better generalization performance. We have theoretically proved the convergence of AdaL and demonstrated its effectiveness on several benchmarks.
翻译:在深层学习中广泛采用了适应性优化方法,根据过去的梯度对学习率进行适应性调整,这已证明对加速趋同十分有效;然而,与SGD相比,学习率普遍化表现不佳;最近的研究显示,平滑指数性梯度噪音会导致普遍化变异现象;因此,我们提议AdaL在原梯度上进行转换;AdaL在早期阶段扩大梯度,加快了趋同速度,并通过以后缩小梯度来抑制振荡和稳定优化速度;这种改变减轻了梯度噪音的平滑性,从而提高了一般化效果;我们理论上证明了AdaL的趋同速度,并在几个基准上表明了其有效性。