This work examines the deep disconnect between existing theoretical analyses of gradient-based algorithms and the practice of training deep neural networks. Specifically, we provide numerical evidence that in large-scale neural network training (e.g., ImageNet + ResNet101, and WT103 + TransformerXL models), the neural network's weights do not converge to stationary points where the gradient of the loss is zero. Remarkably, however, we observe that even though the weights do not converge to stationary points, the progress in minimizing the loss function halts and training loss stabilizes. Inspired by this observation, we propose a new perspective based on ergodic theory of dynamical systems to explain it. Rather than studying the evolution of weights, we study the evolution of the distribution of weights. We prove convergence of the distribution of weights to an approximate invariant measure, thereby explaining how the training loss can stabilize without weights necessarily converging to stationary points. We further discuss how this perspective can better align optimization theory with empirical observations in machine learning practice.
翻译:这项工作考察了基于梯度的算法的现有理论分析与深神经网络培训实践之间的深度脱节。 具体地说,我们提供了数字证据,证明在大型神经网络培训(例如,图像Net+ResNet101和WT103+变压器XL模型)中,神经网络的权重没有与损失梯度为零的固定点趋同。然而,值得注意的是,我们注意到,即使重量不与固定点趋同,在尽量减少损失功能方面的进展也停止了,培训损失稳定了。在这项观察的启发下,我们提出了一个基于动态系统理论的新观点来解释这一点。我们不是研究重量的演变,而是研究重量分布的演变。我们证明,重量的分配与大约的变异度测量值是趋同的,从而解释了培训损失如何稳定,而不必然使重量与固定点相趋同。我们进一步讨论了这种观点如何能够更好地将优化理论与机器学习实践中的经验观测结果相协调。