There is an increasing realization that algorithmic inductive biases are central in preventing overfitting; empirically, we often see a benign overfitting phenomenon in overparameterized settings for natural learning algorithms, such as stochastic gradient descent (SGD), where little to no explicit regularization has been employed. This work considers this issue in arguably the most basic setting: constant-stepsize SGD (with iterate averaging) for linear regression in the overparameterized regime. Our main result provides a sharp excess risk bound, stated in terms of the full eigenspectrum of the data covariance matrix, that reveals a bias-variance decomposition characterizing when generalization is possible: (i) the variance bound is characterized in terms of an effective dimension (specific for SGD) and (ii) the bias bound provides a sharp geometric characterization in terms of the location of the initial iterate (and how it aligns with the data covariance matrix). We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares (minimum-norm interpolation) and ridge regression.
翻译:人们日益认识到,算法引导偏差是防止过分适应的核心;从经验上看,我们常常看到在诸如随机梯度梯度下降(SGD)等自然学习算法的超分化环境中,一种可喜的过度适应现象,在这种环境中,很少甚至没有采用明确的正规化。这项工作在最基本的环境中审议了这一问题:在超分制制度中,以恒定的步态SGD(以中位平均数)为线性回归法,我们的主要结果提供了明显的超重风险,以数据共变矩阵的全部偏差为表示,表明在可能普遍化时存在偏差变异性分解特征:(一) 差异的特征是有效维度(具体针对SGD),和(二) 偏差的界限在最初的 Iterate 位置(以及它如何与数据变差矩阵一致)上提供了鲜明的几何特征。 我们思考了(不正规化的) SGD与普通的最不正方形(im-normolation)相比,SGD提供的算法正规化与回归(im-colizaliztion)之间的若干明显差异。