When equipped with efficient optimization algorithms, the over-parameterized neural networks have demonstrated high level of performance even though the loss function is non-convex and non-smooth. While many works have been focusing on understanding the loss dynamics by training neural networks with the gradient descent (GD), in this work, we consider a broad class of optimization algorithms that are commonly used in practice. For example, we show from a dynamical system perspective that the Heavy Ball (HB) method can converge to global minimum on mean squared error (MSE) at a linear rate (similar to GD); however, the Nesterov accelerated gradient descent (NAG) may only converges to global minimum sublinearly. Our results rely on the connection between neural tangent kernel (NTK) and finite over-parameterized neural networks with ReLU activation, which leads to analyzing the limiting ordinary differential equations (ODE) for optimization algorithms. We show that, optimizing the non-convex loss over the weights corresponds to optimizing some strongly convex loss over the prediction error. As a consequence, we can leverage the classical convex optimization theory to understand the convergence behavior of neural networks. We believe our approach can also be extended to other optimization algorithms and network architectures.
翻译:当安装了高效优化算法时,超临界神经网络表现出了高水平的性能,尽管损失功能是非电解和无线的。虽然许多工作一直侧重于通过培训神经网络与梯度下沉(GD)来理解损失动态。在这项工作中,我们考虑的是一系列广泛的优化算法,这些算法在实践中常用。例如,我们从动态系统的角度显示,重球(HB)方法可以以线性速度(类似于GD)在平均平方差差(MSE)上达到全球最低值;然而,Nesterov加速梯度下沉(NAG)可能只能与全球最低亚线性线性趋同。我们的结果取决于神经离心(NTK)和与有限超度超度超度神经网与RELU激活之间的连接。这将导致分析用于优化算法的限定的普通差异方程式(ODE)。我们显示,在重量上优化非电流损失相当于在预测错误上优化一些强烈的电流损失。因此,我们也可以利用经典螺旋网络的趋同性模型。