There is an increasing realization that algorithmic inductive biases are central in preventing overfitting; empirically, we often see a benign overfitting phenomenon in overparameterized settings for natural learning algorithms, such as stochastic gradient descent (SGD), where little to no explicit regularization has been employed. This work considers this issue in arguably the most basic setting: constant-stepsize SGD (with iterate averaging or tail averaging) for linear regression in the overparameterized regime. Our main result provides a sharp excess risk bound, stated in terms of the full eigenspectrum of the data covariance matrix, that reveals a bias-variance decomposition characterizing when generalization is possible: (i) the variance bound is characterized in terms of an effective dimension (specific for SGD) and (ii) the bias bound provides a sharp geometric characterization in terms of the location of the initial iterate (and how it aligns with the data covariance matrix). More specifically, for SGD with iterate averaging, we demonstrate the sharpness of the established excess risk bound by proving a matching lower bound (up to constant factors). For SGD with tail averaging, we show its advantage over SGD with iterate averaging by proving a better excess risk bound together with a nearly matching lower bound. Moreover, we reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares (minimum-norm interpolation) and ridge regression. Experimental results on synthetic data corroborate our theoretical findings.
翻译:人们日益认识到,算法诱导偏差是防止过度适应的核心;从经验上看,我们常常看到在过度量化的自然学习算法环境中,如超度梯度梯度下降(SGD)中,一种可喜的过度适应现象,在这种环境中,几乎没有采用明确的正规化。这项工作在最基本环境中审议了这一问题:在超度分解制度中,SGD(以偏差平均或尾数平均)为线性回归的常态梯度(以偏差或尾数平均值为准),在超度数据正差矩阵中,我们的主要结果提供了明显的超额风险约束,以数据正值正差全面表示,这表明在可能普遍化时,存在偏差偏差的偏差性分解特征:(一) 差异的特征是有效的维度(以SGDT为特),而偏差的偏差在最初的偏差上提供了清晰的几何特征描述(以数据偏差为最低的正比值) 更具体地显示SGDMD和正值之间确定的超度风险的锐性风险(以正比值比值比,以正平差为正平比值比值为我们的正差的正差的正比值) 。