Gradient Descent (GD) is a powerful workhorse of modern machine learning thanks to its scalability and efficiency in high-dimensional spaces. Its ability to find local minimisers is only guaranteed for losses with Lipschitz gradients, where it can be seen as a 'bona-fide' discretisation of an underlying gradient flow. Yet, many ML setups involving overparametrised models do not fall into this problem class, which has motivated research beyond the so-called "Edge of Stability", where the step-size crosses the admissibility threshold inversely proportional to the Lipschitz constant above. Perhaps surprisingly, GD has been empirically observed to still converge regardless of local instability. In this work, we study a local condition for such an unstable convergence around a local minima in a low dimensional setting. We then leverage these insights to establish global convergence of a two-layer single-neuron ReLU student network aligning with the teacher neuron in a large learning rate beyond the Edge of Stability under population loss. Meanwhile, while the difference of norms of the two layers is preserved by gradient flow, we show that GD above the edge of stability induces a balancing effect, leading to the same norms across the layers.
翻译:由于在高维空间的可伸缩性和效率,梯度源(GD)是现代机器学习的强大一匹强大的工作马。它找到本地最小值的能力只能保证利普西茨梯度的损失。 它只有在利普西茨梯度(lipschitz 梯度)的损耗时才能找到本地最小值, 在那里, 它可以被视为一个潜在的梯度流的“ bona- fide ” 离散 。 然而, 许多涉及过度偏差模型的ML 设置并不属于这个问题类别, 它激发了超越所谓的“ 稳定之地” 的研究, 即“ 稳定之地” 的跨度可接受性临界值与上面的利普西茨恒定值成反比。 也许令人惊讶的是, GD 在这项工作中, 在低维度环境下, 我们研究一个不稳定的本地条件, 在一个地方迷你马周围 。 然后我们利用这些洞察来建立双层单中子ReLU 学生网络, 与教师的神经在人口损失下的大学习率中相匹配。 同时, 两层的规范差异的差别由梯度流保持。 同时, 我们显示, 水平的平衡在稳定的边缘 。