Despite the empirical success of deep learning, it still lacks theoretical understandings to explain why randomly initialized neural network trained by first-order optimization methods is able to achieve zero training loss, even though its landscape is non-convex and non-smooth. Recently, there are some works to demystifies this phenomenon under over-parameterized regime. In this work, we make further progress on this area by considering a commonly used momentum optimization algorithm: Nesterov accelerated method (NAG). We analyze the convergence of NAG for two-layer fully connected neural network with ReLU activation. Specifically, we prove that the error of NAG converges to zero at a linear convergence rate $1-\Theta(1/\sqrt{\kappa})$, where $\kappa > 1$ is determined by the initialization and the architecture of neural network. Comparing to the rate $1-\Theta(1/\kappa)$ of gradient descent, NAG achieves an acceleration. Besides, it also validates NAG and Heavy-ball method can achieve a similar convergence rate.
翻译:尽管深层学习取得了成功,但它仍然缺乏理论上的谅解来解释为什么通过一阶优化方法培训的随机初始化神经网络能够实现零培训损失,尽管其风景不是康维克斯,也不是潮湿的。最近,在超分化制度下,有一些工作可以解开这种现象的神秘性。在这项工作中,我们通过考虑一种常用的势头优化算法(Nesterov 加速法)而在这一领域取得进一步进展。我们分析了与ReLU启动的两层完全连接的神经网络的NAG趋同情况。具体地说,我们证明,NAG的错误以1\Theta(1/Sqrthukappa)的线性趋同率(1\\\\ sqrthukappa)合为零,而美元 > 1美元是由神经网络的初始化和结构决定的。与1\Theta(1\\\\\\\\ kapappa)美元的梯度下降率相比,NAG也实现了加速。此外,我们证明NAG和重球方法也能够实现类似的趋同的趋同率。