Recently, researchers observed that gradient descent for deep neural networks operates in an ``edge-of-stability'' (EoS) regime: the sharpness (maximum eigenvalue of the Hessian) is often larger than stability threshold $2/\eta$ (where $\eta$ is the step size). Despite this, the loss oscillates and converges in the long run, and the sharpness at the end is just slightly below $2/\eta$. While many other well-understood nonconvex objectives such as matrix factorization or two-layer networks can also converge despite large sharpness, there is often a larger gap between sharpness of the endpoint and $2/\eta$. In this paper, we study EoS phenomenon by constructing a simple function that has the same behavior. We give rigorous analysis for its training dynamics in a large local region and explain why the final converging point has sharpness close to $2/\eta$. Globally we observe that the training dynamics for our example has an interesting bifurcating behavior, which was also observed in the training of neural nets.
翻译:最近,研究人员发现,深神经网络的梯度下降在“稳定性前沿”(Eos)制度下运作:锐度(赫西安人最大值)往往大于2美元/元的稳定阈值。尽管如此,长期而言,损失的振动和汇合作用,以及最终的锐度略低于2美元/元。尽管矩阵因子化或两层网络等许多深为人们理解的非凝固目标也可以聚集在一起,但终点的锐度和2美元/元之间往往存在更大的差距。在本论文中,我们通过构建一个具有相同行为的简单功能来研究Eos现象。我们对其在大地区的培训动态进行严格分析,并解释最后的趋同点为何急剧接近于2美元/元。从全球看,我们的例子的培训动态具有一种有趣的双向行为,这也是在神经网培训中也观察到的。