This paper studies an intriguing phenomenon related to the good generalization performance of estimators obtained by using large learning rates within gradient descent algorithms. First observed in the deep learning literature, we show that a phenomenon can be precisely characterized in the context of kernel methods, even though the resulting optimization problem is convex. Specifically, we consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution on the Hessian's eigenvectors. This extends an intuition described by Nakkiran (2020) on a two-dimensional toy problem to realistic learning scenarios such as kernel ridge regression. While large learning rates may be proven beneficial as soon as there is a mismatch between the train and test objectives, we further explain why it already occurs in classification tasks without assuming any particular mismatch between train and test data distributions.
翻译:本文研究了一个与利用梯度下移算法中高学习率获得的测算员的良好一般化表现有关的令人感兴趣的现象。 首先,在深层学习文献中,我们发现一个现象在内核方法中可以精确地描述为一种现象,尽管由此产生的优化问题是锥形的。具体地说,我们考虑在分立的Hilbert空间中将二次目标最小化,并表明随着早期停止,学习率的选择会影响赫塞西安人脑中已获得的解决方案的光谱分解。这把Nakkiran(202020年)描述的关于二维多功能问题的直觉延伸至现实的学习情景,例如内核脊回归。虽然一旦火车和测试目标之间出现不匹配,大量学习率可能证明是有用的,但我们进一步解释为什么在分类任务中已经出现这种差异,而没有假定火车和测试数据分布之间有任何特定的不匹配。