In this work we establish an algorithm and distribution independent non-asymptotic trade-off between the model size, excess test loss, and training loss of linear predictors. Specifically, we show that models that perform well on the test data (have low excess loss) are either "classical" -- have training loss close to the noise level, or are "modern" -- have a much larger number of parameters compared to the minimum needed to fit the training data exactly. We also provide a more precise asymptotic analysis when the limiting spectral distribution of the whitened features is Marchenko-Pastur. Remarkably, while the Marchenko-Pastur analysis is far more precise near the interpolation peak, where the number of parameters is just enough to fit the training data, it coincides exactly with the distribution independent bound as the level of overparametrization increases.
翻译:在这项工作中,我们建立了一个与算法和分布无关的非渐进权衡,它涉及线性预测器的模型规模、过量的测试损失和训练损失。具体而言,我们表明在测试数据上表现良好的模型(具有较低的过量损失)要么是“传统的”——它们的训练损失接近于噪声级别,要么是“现代的”——它们的参数数目比最小的恰好适合训练数据的参数数目要大得多。当白化特征的极限谱分布是Marchenko-Pastur分布时,我们还提供了更精确的渐近分析。值得注意的是,当过参数化的水平增加时,虽然Marchenko-Pastur分析在插值峰附近更加精确,其中参数数目正好足够拟合训练数据,但它与分布无关的上限完全一致。