The recipe behind the success of deep learning has been the combination of neural networks and gradient-based optimization. Understanding the behavior of gradient descent however, and particularly its instability, has lagged behind its empirical success. To add to the theoretical tools available to study gradient descent we propose the principal flow (PF), a continuous time flow that approximates gradient descent dynamics. To our knowledge, the PF is the only continuous flow that captures the divergent and oscillatory behaviors of gradient descent, including escaping local minima and saddle points. Through its dependence on the eigendecomposition of the Hessian the PF sheds light on the recently observed edge of stability phenomena in deep learning. Using our new understanding of instability we propose a learning rate adaptation method which enables us to control the trade-off between training stability and test set evaluation performance.
翻译:深度学习成功的秘诀在于神经网络和基于梯度的优化的结合。然而,理解梯度下降的行为,特别是其不稳定性,落后于它的经验成功。为了增加研究梯度下降的理论工具,我们提出了主流(PF),一种近似梯度下降动态的连续时间流。据我们所知,PF是唯一一个捕捉梯度下降发散和振荡行为的连续流,包括逃离局部极小值和鞍点。通过对Hessian的特征分解的依赖,PF揭示了深度学习中最近观察到的稳定性边缘现象。利用我们对不稳定性的新理解,我们提出了一种学习率调整方法,使我们能够控制训练稳定性和测试集评估性能之间的权衡。