Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss; yet surprisingly, they possess near-optimal prediction performance, contradicting classical learning theory. We examine how these benign overfitting phenomena occur in a two-layer neural network setting where sample covariates are corrupted with noise. We address the high dimensional regime, where the data dimension $d$ grows with the number $n$ of data points. Our analysis combines an upper bound on the bias with matching upper and lower bounds on the variance of the interpolator (an estimator that interpolates the data). These results indicate that the excess learning risk of the interpolator decays under mild conditions. We further show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate, which to our knowledge is the first generalization result for such networks. Finally, our theory predicts that the excess learning risk starts to increase once the number of parameters $s$ grows beyond $O(n^2)$, matching recent empirical findings.
翻译:现代机器学习模式通常使用大量参数,通常优化为零培训损失;但令人惊讶的是,它们拥有接近最佳的预测性能,这与古典学习理论相矛盾。我们审视了在两层神经网络环境中这些无害的超常现象是如何发生的,在这两层神经网络中,样本的共异性因噪音而腐蚀。我们处理的是高维系统,数据维度美元随着数据点的美元数增长而增长。我们的分析将偏向的上限与对乘数差异的上限和下限相匹配(一个估算数据内插的估测器)结合起来。这些结果表明,两层ReLU网络的跨极者在温和条件下衰减的超常学习风险。我们进一步表明,两层ReLU网络的跨极者有可能实现接近微缩缩成最佳的学习率,据我们所知,这是这类网络的第一个概括结果。最后,我们的理论预测,超额学习风险一旦参数数超过$O(n)美元,就会开始增加,与最近的实证结果相匹配。