In many modern applications of deep learning the neural network has many more parameters than the data points used for its training. Motivated by those practices, a large body of recent theoretical research has been devoted to studying overparameterized models. One of the central phenomena in this regime is the ability of the model to interpolate noisy data, but still have test error lower than the amount of noise in that data. arXiv:1906.11300 characterized for which covariance structure of the data such a phenomenon can happen in linear regression if one considers the interpolating solution with minimum $\ell_2$-norm and the data has independent components: they gave a sharp bound on the variance term and showed that it can be small if and only if the data covariance has high effective rank in a subspace of small co-dimension. We strengthen and complete their results by eliminating the independence assumption and providing sharp bounds for the bias term. Thus, our results apply in a much more general setting than those of arXiv:1906.11300, e.g., kernel regression, and not only characterize how the noise is damped but also which part of the true signal is learned. Moreover, we extend the result to the setting of ridge regression, which allows us to explain another interesting phenomenon: we give general sufficient conditions under which the optimal regularization is negative.
翻译:在许多深层学习的现代应用中,神经网络的参数比其培训所用的数据点要多得多。在这些做法的推动下,最近大量理论研究都致力于研究过分分化模型。这个制度的一个中心现象是模型能够内插噪音数据,但是仍然有比数据中噪音量低的测试错误。 arxiv:1906.11300的特征是,这些数据的共变结构在线性回归中可以发生,如果考虑到最小值为$_2美元-诺尔姆和数据具有独立组成部分的内插解决方案:它们对差异术语作了鲜明的限定,并且表明只有数据共变异在小相混合的子空间中具有高度有效等级时,它才可能很小。我们通过消除独立假设和为偏差术语提供尖的界限来加强和完成它们的结果。因此,我们的结果在比arxiv:1906.11300的负面环境更普遍应用,例如,内核倒退,并且不仅说明我们如何用最精确的信号来解释我们如何在一般的回归中测深层次上,我们又能够解释另一个令人兴奋的结果。