Due to the simplicity and efficiency of the first-order gradient method, it has been widely used in training neural networks. Although the optimization problem of the neural network is non-convex, recent research has proved that the first-order method is capable of attaining a global minimum for training over-parameterized neural networks, where the number of parameters is significantly larger than that of training instances. Momentum methods, including heavy ball method (HB) and Nesterov's accelerated method (NAG), are the workhorse first-order gradient methods owning to their accelerated convergence. In practice, NAG often exhibits better performance than HB. However, current research fails to distinguish their convergence difference in training neural networks. Motivated by this, we provide convergence analysis of HB and NAG in training an over-parameterized two-layer neural network with ReLU activation, through the lens of high-resolution dynamical systems and neural tangent kernel (NTK) theory. Compared to existing works, our analysis not only establishes tighter upper bounds of the convergence rate for both HB and NAG, but also characterizes the effect of the gradient correction term, which leads to the acceleration of NAG over HB. Finally, we validate our theoretical result on three benchmark datasets.
翻译:由于一阶梯度方法的简单和效率,该方法被广泛用于神经网络的培训。虽然神经网络的最优化问题不是康维克斯,但最近的研究证明,第一阶方法能够达到对超分度神经网络培训的全球最低要求,其参数数量大大高于培训实例。动力方法,包括重球法和Nesterov加速法(NAG),是其加速趋同所拥有的第一阶梯度方法。在实践中,NAG的表现往往比HB好。然而,目前的研究未能区分其在培训神经网络中的趋同差异。为此,我们通过高分辨率动态系统和纳氏加速法理论的透镜,对HB和NAG的趋同进行了趋同分析,培训一个具有ReLU激活作用的超分度双层神经网络。与RLU相比,我们的分析不仅为HB和NAG的趋同率设定了较近的上限。我们最后对HAG的升级结果进行了理论校验。