In this work we explore the limiting dynamics of deep neural networks trained with stochastic gradient descent (SGD). We find empirically that long after performance has converged, networks continue to move through parameter space by a process of anomalous diffusion in which distance travelled grows as a power law in the number of gradient updates with a nontrivial exponent. We reveal an intricate interaction between the hyperparameters of optimization, the structure in the gradient noise, and the Hessian matrix at the end of training that explains this anomalous diffusion. To build this understanding, we first derive a continuous-time model for SGD with finite learning rates and batch sizes as an underdamped Langevin equation. We study this equation in the setting of linear regression, where we can derive exact, analytic expressions for the phase space dynamics of the parameters and their instantaneous velocities from initialization to stationarity. Using the Fokker-Planck equation, we show that the key ingredient driving these dynamics is not the original training loss, but rather the combination of a modified loss, which implicitly regularizes the velocity, and probability currents, which cause oscillations in phase space. We identify qualitative and quantitative predictions of this theory in the dynamics of a ResNet-18 model trained on ImageNet. Through the lens of statistical physics, we uncover a mechanistic origin for the anomalous limiting dynamics of deep neural networks trained with SGD.
翻译:在这项工作中,我们探索了经过精密梯度下降(SGD)训练的深心神经网络的有限动态。我们从经验中发现,在性能趋同很久之后,网络继续通过参数空间通过一个反常扩散过程在参数空间中移动,在这种反常扩散过程中,行走的距离随着一个非边际的速率增长而增长,在梯度更新的数量中作为一种权力法。我们揭示了优化的超参数、梯度噪声的结构以及解释这种反常的透镜扩散的培训结束时的赫森矩阵之间的复杂互动。为了建立这种理解,我们首先为SGD制作了一个持续的时间模型,其学习速度有限,分批体大小作为不受欢迎的朗埃文方程式。我们在线性回归的设置中研究这一方程式,我们可以准确地得出参数的阶段空间动态及其从初始化到静态的瞬间速度的分析性表达。我们用Fokker-Planck方程方程式表明,驱动这些动态的关键因素不是最初的培训损失,而是将经过训练的内向的内流损失结合起来,从而隐含地调整了S-18的内流的内基动态,从而确定了空间的深度、正位和正态的模型的动态,从而从空间的理论的理论的模型,从而确定了一个从空间的理论的理论的动态。