Classical optimisation theory guarantees monotonic objective decrease for gradient descent (GD) when employed in a small step size, or ``stable", regime. In contrast, gradient descent on neural networks is frequently performed in a large step size regime called the ``edge of stability", in which the objective decreases non-monotonically with an observed implicit bias towards flat minima. In this paper, we take a step toward quantifying this phenomenon by providing convergence rates for gradient descent with large learning rates in an overparametrised least squares setting. The key insight behind our analysis is that, as a consequence of overparametrisation, the set of global minimisers forms a Riemannian manifold $M$, which enables the decomposition of the GD dynamics into components parallel and orthogonal to $M$. The parallel component corresponds to Riemannian gradient descent on the objective sharpness, while the orthogonal component is a bifurcating dynamical system. This insight allows us to derive convergence rates in three regimes characterised by the learning rate size: (a) the subcritical regime, in which transient instability is overcome in finite time before linear convergence to a suboptimally flat global minimum; (b) the critical regime, in which instability persists for all time with a power-law convergence toward the optimally flat global minimum; and (c) the supercritical regime, in which instability persists for all time with linear convergence to an orbit of period two centred on the optimally flat global minimum.
翻译:经典优化理论保证梯度下降(GD)在小步长(或称“稳定”)区间内可实现目标函数的单调下降。相比之下,神经网络上的梯度下降常在大步长区间(称为“稳定性边缘”)进行,此时目标函数呈非单调下降,并观察到其隐式偏好平坦极小值。本文通过分析过参数化最小二乘设定下采用大学习率的梯度下降收敛速率,为量化该现象迈出一步。我们分析的关键洞见在于:过参数化导致全局极小值集合构成黎曼流形$M$,这使得梯度下降动力学可分解为平行于$M$和正交于$M$的分量。平行分量对应目标函数锐度的黎曼梯度下降,正交分量则构成分岔动力系统。基于此洞见,我们推导出三种学习率区间对应的收敛速率:(a)亚临界区间:瞬时不稳定性在有限时间内被克服,随后以线性速率收敛至次优平坦全局极小值;(b)临界区间:不稳定性持续存在,以幂律收敛速率逼近最优平坦全局极小值;(c)超临界区间:不稳定性持续存在,以线性速率收敛至以最优平坦全局极小值为中心的周期二轨道。