We systematically analyze optimization dynamics in deep neural networks (DNNs) trained with stochastic gradient descent (SGD) over long time scales and study the effect of learning rate, depth, and width of the neural network. By analyzing the maximum eigenvalue $\lambda^H_t$ of the Hessian of the loss, which is a measure of sharpness of the loss landscape, we find that the dynamics can show four distinct regimes: (i) an early time transient regime, (ii) an intermediate saturation regime, (iii) a progressive sharpening regime, and finally (iv) a late time ``edge of stability" regime. The early and intermediate regimes (i) and (ii) exhibit a rich phase diagram depending on learning rate $\eta \equiv c/\lambda^H_0$, depth $d$, and width $w$. We identify several critical values of $c$ which separate qualitatively distinct phenomena in the early time dynamics of training loss and sharpness, and extract their dependence on $d/w$. Our results have implications for how to scale the learning rate with DNN depth and width in order to remain in the same phase of learning.
翻译:我们系统地分析长期内深神经网络(DNNS)中经过随机梯度梯度下降(SGD)培训的深层神经网络的优化动态,研究神经网络学习率、深度和宽度的影响。我们通过分析赫塞尼亚损失的最大egenvaly $\lambda ⁇ H_t$(赫塞尼亚损失最高值$\ lambda_H$$美元,这是损失地貌的锐利度的一个尺度),我们发现这些动态可以显示四个不同的制度:(一) 早期短暂的制度,(二) 中间饱和制度,(三) 逐步强化的制度,以及最后(四) 晚些时候“稳定边缘”的制度。早期和中间的制度(一)和(二) 展示一个内容丰富的阶段图,取决于学习率 $\ equiv c/\lambda_H_0美元, 深度$和宽度$w美元。我们确定了几个关键值,即$C$,这在早期培训损失和锐度损失的动态中将质量区别开来,并提取其对$/w$的依赖性。我们的结果对学习阶段的深度和深度的影响是,如何使学习速度保持相同的深度。