We empirically demonstrate that full-batch gradient descent on neural network training objectives typically operates in a regime we call the Edge of Stability. In this regime, the maximum eigenvalue of the training loss Hessian hovers just above the numerical value $2 / \text{(step size)}$, and the training loss behaves non-monotonically over short timescales, yet consistently decreases over long timescales. Since this behavior is inconsistent with several widespread presumptions in the field of optimization, our findings raise questions as to whether these presumptions are relevant to neural network training. We hope that our findings will inspire future efforts aimed at rigorously understanding optimization at the Edge of Stability. Code is available at https://github.com/locuslab/edge-of-stability.
翻译:我们从经验上证明,神经网络培训目标的全面梯度下降通常在我们称之为“稳定边缘”的制度下运作。在这个制度下,培训损失的极限值仅高于数值值2美元/\text{(步数大小)$,而培训损失在短期内是非单数性的,但在长期内却持续下降。由于这种行为与优化领域的一些普遍假设不一致,我们的调查结果提出了这些假设是否与神经网络培训相关的问题。我们希望,我们的调查结果将激发今后旨在严格理解稳定边缘优化的努力。代码可在https://github.com/locuslab/sedge-of-stable上查阅。