It is a well-known fact that nonconvex optimization is computationally intractable in the worst case. As a result, theoretical analysis of optimization algorithms such as gradient descent often focuses on local convergence to stationary points where the gradient norm is zero or negligible. In this work, we examine the disconnect between the existing theoretical analysis of gradient-based algorithms and actual practice. Specifically, we provide numerical evidence that in large-scale neural network training, such as in ImageNet, ResNet, and WT103 + TransformerXL models, the Neural Network weight variables do not converge to stationary points where the gradient of the loss function vanishes. Remarkably, however, we observe that while weights do not converge to stationary points, the value of the loss function converges. Inspired by this observation, we propose a new perspective based on ergodic theory of dynamical systems. We prove convergence of the distribution of weight values to an approximate invariant measure (without smoothness assumptions) that explains this phenomenon. We further discuss how this perspective can better align the theory with empirical observations.
翻译:众所周知的事实是,非convex优化在最坏的情况下难以计算。 因此,对梯度下降等优化算法的理论分析往往侧重于局部趋同到坡度规范为零或可忽略不计的固定点。 在这项工作中,我们审视了基于梯度算法的现有理论分析与实际实践之间的脱节。 具体地说,我们提供了数字证据,证明在图像网、ResNet和WT103+变形XL模型等大型神经网络培训中,神经网络加权变量没有与损失函数梯度消失的固定点相融合。然而,我们观察到,虽然重量不与固定点趋同,但损失函数的价值却会趋同。我们受这一观察的启发,提出了基于动态系统自转理论的新观点。我们证明,重量值的分布与解释这一现象的近似变量(没有平稳假设)相融合。我们进一步讨论了这种观点如何更好地将理论与实证观测结果相协调。