Existing analyses of neural network training often operate under the unrealistic assumption of an extremely small learning rate. This lies in stark contrast to practical wisdom and empirical studies, such as the work of J. Cohen et al. (ICLR 2021), which exhibit startling new phenomena (the "edge of stability" or "unstable convergence") and potential benefits for generalization in the large learning rate regime. Despite a flurry of recent works on this topic, however, the latter effect is still poorly understood. In this paper, we take a step towards understanding genuinely non-convex training dynamics with large learning rates by performing a detailed analysis of gradient descent for simplified models of two-layer neural networks. For these models, we provably establish the edge of stability phenomenon and discover a sharp phase transition for the step size below which the neural network fails to learn "threshold-like" neurons (i.e., neurons with a non-zero first-layer bias). This elucidates one possible mechanism by which the edge of stability can in fact lead to better generalization, as threshold neurons are basic building blocks with useful inductive bias for many tasks.
翻译:对神经网络培训的现有分析往往在极小学习率这一不切实际的假设下运作。这与实际的智慧和经验研究形成鲜明对比,例如J. Cohen等人(ICLR 2021)的工作(ICLR 2021)的工作,这些研究展示了惊人的新现象(“稳定前沿”或“不稳定趋同”)和在大型学习率制度中普遍化的潜在好处。尽管最近有关这一专题的工作一阵风吹风,但后一影响仍然不为人所知。在本文件中,我们迈出了一步,通过对两层神经网络简化模型的梯度下降进行详细分析,从而理解真正的非凝固培训动态,高学习率。对于这些模型,我们可以明显地确定稳定现象的边缘,并发现一个阶段性转变阶段,在这个阶段以下,神经网络无法学习“类似”神经元(即非零一层偏差的神经元) 。这说明了一种可能的机制,即稳定边缘事实上可以导致更好的普遍化,因为临界神经元是许多任务中具有实用的直觉偏差的基本构件。