Gradient Descent (GD) is a powerful workhorse of modern machine learning thanks to its scalability and efficiency in high-dimensional spaces. Its ability to find local minimisers is only guaranteed for losses with Lipschitz gradients, where it can be seen as a `bona-fide' discretisation of an underlying gradient flow. Yet, many ML setups involving overparametrised models do not fall into this problem class, which has motivated research beyond the so-called ``Edge of Stability'' (EoS), where the step-size crosses the admissibility threshold inversely proportional to the Lipschitz constant above. Perhaps surprisingly, GD has been empirically observed to still converge regardless of local instability and oscillatory behavior. The incipient theoretical analysis of this phenomena has mainly focused in the overparametrised regime, where the effect of choosing a large learning rate may be associated to a `Sharpness-Minimisation' implicit regularisation within the manifold of minimisers, under appropriate asymptotic limits. In contrast, in this work we directly examine the conditions for such unstable convergence, focusing on simple, yet representative, learning problems. Specifically, we characterize a local condition involving third-order derivatives that stabilizes oscillations of GD above the EoS, and leverage such property in a teacher-student setting, under population loss. Finally, focusing on Matrix Factorization, we establish a non-asymptotic `Local Implicit Bias' of GD above the EoS, whereby quasi-symmetric initializations converge to symmetric solutions -- where sharpness is minimum amongst all minimisers.
翻译:梯度梯度(GD)是现代机器学习的强大一匹力,因为它在高空空间的伸缩性和效率。 它找到本地最小值的能力只能保证利普西茨梯度的损失。 它只有在利普西茨梯度( Lipschitz climates) 的情况下, 才能保证找到本地最小值。 在利普西茨梯度( Lipschitz climates) 的亏损中, 它可以被看作一个“ bona- fide ” 的分流。 然而, 许多涉及过度偏差模型的ML 设置并没有落到这个问题类别中, 这促使研究超越所谓的“ 稳定指数 ” ( Eoos) ( Eoos) 的跨度跨度, 与上面的利普西茨恒定值不相成反比。 令人惊讶的是, GD(G) 最初的理论分析主要集中在过度的系统, 选择高的学习率可能与“ 优化- 最小度” 师级(S) (Sho) (S) ladicial- dalcial) ralizaliztion) ralizalizalizlation (Oration) (Oration) (O) leg) (我们直接) (Oration) (I) (I) (O) (I) (I) (I) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O)