Over-parameterization and adaptive methods have played a crucial role in the success of deep learning in the last decade. The widespread use of over-parameterization has forced us to rethink generalization by bringing forth new phenomena, such as implicit regularization of optimization algorithms and double descent with training progression. A series of recent works have started to shed light on these areas in the quest to understand -- why do neural networks generalize well? The setting of over-parameterized linear regression has provided key insights into understanding this mysterious behavior of neural networks. In this paper, we aim to characterize the performance of adaptive methods in the over-parameterized linear regression setting. First, we focus on two sub-classes of adaptive methods depending on their generalization performance. For the first class of adaptive methods, the parameter vector remains in the span of the data and converges to the minimum norm solution like gradient descent (GD). On the other hand, for the second class of adaptive methods, the gradient rotation caused by the pre-conditioner matrix results in an in-span component of the parameter vector that converges to the minimum norm solution and the out-of-span component that saturates. Our experiments on over-parameterized linear regression and deep neural networks support this theory.
翻译:在过去十年中,过度参数化和适应性方法在深层学习的成功中发挥了关键作用。过度参数化的广泛使用迫使我们重新思考一般化,提出新的现象,例如优化算法的隐性正规化和与培训进展的双向下降。最近的一系列工作开始揭示这些领域,以寻求理解 -- -- 为什么神经网络能够很好地普及?过分参数化线性回归的设置为理解神经网络的这种神秘行为提供了关键的洞察力。在本文中,我们的目标是说明过度参数化线性回归设置中适应方法的性能。首先,我们注重适应方法的两个亚类,视其一般化性能而定。关于第一组适应方法,参数矢量处在数据范围内,与最起码的规范解决方案如梯度下降(GD)相融合。另一方面,对于第二类适应方法,由预参数矩阵引起的梯度旋转导致的参数矢量在泛星部分中的结果,该参数矢量值与最低限度标准解决方案相融合,外线性线性回归系统支持这一深层线性理论。