Nesterov的超计量神经网络加速梯度方法的可探测汇合 (Provable Convergence of Nesterov's Accelerated Gradient Method for Over-Parameterized Neural Networks)

Momentum methods, such as heavy ball method~(HB) and Nesterov's accelerated gradient method~(NAG), have been widely used in training neural networks by incorporating the history of gradients into the current updating process. In practice, they often provide improved performance over (stochastic) gradient descent~(GD) with faster convergence. Despite these empirical successes, theoretical understandings of their accelerated convergence rates are still lacking. Recently, some attempts have been made by analyzing the trajectories of gradient-based methods in an over-parameterized regime, where the number of the parameters is significantly larger than the number of the training instances. However, the majority of existing theoretical work is mainly concerned with GD and the established convergence result of NAG is inferior to HB and GD, which fails to explain the practical success of NAG. In this paper, we take a step towards closing this gap by analyzing NAG in training a randomly initialized over-parameterized two-layer fully connected neural network with ReLU activation. Despite the fact that the objective function is non-convex and non-smooth, we show that NAG converges to a global minimum at a non-asymptotic linear rate $(1-\Theta(1/\sqrt{\kappa}))^t$, where $\kappa > 1$ is the condition number of a gram matrix and $t$ is the number of the iterations. Compared to the convergence rate $(1-\Theta(1/{\kappa}))^t$ of GD, our result provides theoretical guarantees for the acceleration of NAG in neural network training. Furthermore, our findings suggest that NAG and HB have similar convergence rate. Finally, we conduct extensive experiments on six benchmark datasets to validate the correctness of our theoretical results.

翻译：(HB) 和 Nesterov 加速梯度法 (NAG) 等调制方法在神经网络培训中被广泛使用,将梯度历史纳入当前更新进程。实际上,这些方法往往能提高(沙沙) 梯度下游 ~ (GD) 的性能,并更快地趋同。尽管取得了这些经验性的成功,但对其加速趋同率的理论理解仍然不足。最近,一些尝试通过在过度参数化的系统中分析基于梯度的方法的轨迹来分析基于梯度的方法(HB) 加速梯度的方法 ~ (NAG) 加速度方法在神经网络中,参数的数量大大高于培训次数。然而,现有的理论工作主要涉及GD 和NAGA的既定趋同性结果低于HB 和GD 。在本文中,我们通过随机初始化的对数值过量的两层完全连接的神经网络进行测试, 与RELU 激活的结果是, 我们的客观函数是非对数值的理论值和不透化的纳基值水平。