Neural networks have achieved tremendous empirical success in many areas. It has been observed that a randomly initialized neural network trained by first-order methods is able to achieve near-zero training loss, although its loss landscape is non-convex and non-smooth. There are few theoretical explanations for this phenomenon. Recently, some attempts have been made to bridge this gap between practice and theory by analyzing the trajectories of gradient descent~(GD) and heavy-ball method~(HB) in an over-parameterized regime. In this work, we make further progress by considering Nesterov's accelerated gradient method~(NAG) with a constant momentum parameter. We analyze its convergence for an over-parameterized two-layer fully connected neural network with ReLU activation. Specifically, we prove that the training error of NAG converges to zero at a non-asymptotic linear convergence rate $(1-\Theta(1/\sqrt{\kappa}))^t$ after $t$ iterations, where $\kappa > 1$ is determined by the initialization and the architecture of the neural network. Besides, we present a comparison between NAG and the existing convergence results of GD and HB. Our theoretical results show that NAG achieves an acceleration over GD and its convergence rate is comparable to HB. Furthermore, the numerical experiments validate the correctness of our theoretical analysis.
翻译:在许多领域,神经网络取得了巨大的实证成功。 观察到一个随机初始化的神经网络,通过一阶方法培训的神经网络能够实现近零培训损失,尽管其损失场景是非康韦克斯和非毛体的。 这一现象在理论上没有多少理论解释。 最近,有人试图通过分析梯度下降-(GD)和重球方法~(HB)在过度分界线化制度中的轨迹来弥合实践与理论之间的差距。 在这项工作中,我们通过考虑Nesterov加速梯度-(NAG)的加速梯度-(NAG)方法(NAG)来取得进一步的进展。 我们分析了它与RELU启动的双层完全连接的神经网络的趋同性。 具体地说,我们证明NAG的训练错误在非偏差线性线性趋同率( 1-\ Theta (1/Sqrt) ~(HD) ) ) 之后, $($\ kapta) ) (H) 的加速度(NAG) 的加速递归正统) 分析结果的比较和我们现有的GAG UR级结果。