Momentum methods, including heavy-ball~(HB) and Nesterov's accelerated gradient~(NAG), are widely used in training neural networks for their fast convergence. However, there is a lack of theoretical guarantees for their convergence and acceleration since the optimization landscape of the neural network is non-convex. Nowadays, some works make progress towards understanding the convergence of momentum methods in an over-parameterized regime, where the number of the parameters exceeds that of the training instances. Nonetheless, current results mainly focus on the two-layer neural network, which are far from explaining the remarkable success of the momentum methods in training deep neural networks. Motivated by this, we investigate the convergence of NAG with constant learning rate and momentum parameter in training two architectures of deep linear networks: deep fully-connected linear neural networks and deep linear ResNets. Based on the over-parameterization regime, we first analyze the residual dynamics induced by the training trajectory of NAG for a deep fully-connected linear neural network under the random Gaussian initialization. Our results show that NAG can converge to the global minimum at a $(1 - \mathcal{O}(1/\sqrt{\kappa}))^t$ rate, where $t$ is the iteration number and $\kappa > 1$ is a constant depending on the condition number of the feature matrix. Compared to the $(1 - \mathcal{O}(1/{\kappa}))^t$ rate of GD, NAG achieves an acceleration over GD. To the best of our knowledge, this is the first theoretical guarantee for the convergence of NAG to the global minimum in training deep neural networks. Furthermore, we extend our analysis to deep linear ResNets and derive a similar convergence result.
翻译:(HB) 和 Nestarov 加速的梯度 ~ (NAG) 等调控方法被广泛用于神经神经网络的训练。 然而,由于神经网络优化景观是非电离层的,因此缺乏对其趋同和加速的理论保障。 如今,有些工作在理解超分度系统中动力方法趋同方面取得了进展, 参数的数量超过了培训实例。 然而, 目前的结果主要集中于两层神经神经网络, 这远远不能解释深层神经网络培训的势头网络的显著成功。 受此影响, 我们调查NAG与两个深度线性网络结构的持续学习速度和动力参数的趋同: 深度连接的线性神经网络和深度线性ResNet。 根据超分度制度, 我们首先分析NAG的培训轨迹导致的剩余动态, 在随机初始化中, 深度连接的线性神经神经数据网络, 最接近于OQQ$ 。 (NAGAG) 和最起码的直线性数据是 。