Classically, data interpolation with a parametrized model class is possible as long as the number of parameters is larger than the number of equations to be satisfied. A puzzling phenomenon in deep learning is that models are trained with many more parameters than what this classical theory would suggest. We propose a partial theoretical explanation for this phenomenon. We prove that for a broad class of data distributions and model classes, overparametrization is necessary if one wants to interpolate the data smoothly. Namely we show that smooth interpolation requires $d$ times more parameters than mere interpolation, where $d$ is the ambient data dimension. We prove this universal law of robustness for any smoothly parametrized function class with polynomial size weights, and any covariate distribution verifying isoperimetry. In the case of two-layers neural networks and Gaussian covariates, this law was conjectured in prior work by Bubeck, Li and Nagaraj. We also give an interpretation of our result as an improved generalization bound for model classes consisting of smooth functions.
翻译:典型地说,只要参数数量大于需要满足的方程式数量,就有可能用一个模型类进行数据内插。深层学习中的一个令人费解的现象是,模型所培训的参数比古典理论所暗示的要多得多。我们建议对这种现象进行局部的理论解释。我们证明,对于数据分布和模型类的广泛类别来说,如果想要顺利地将数据内插,则过度平衡是必要的。也就是说,我们表明,光滑的内插需要比单纯的内插多一倍的参数,而光滑的内插需要多出两倍的参数,而光滑的内插是环境数据层面的美元。我们证明,对于任何光滑动的对等离子函数类,任何对均匀的对等分分布进行校准的校正法,以及任何对等差分布进行校正的测量。在两层神经网络和高斯方的变差中,这项法律在布贝克、利和纳加拉杰的先前工作中是必然的。我们还解释了我们的结果,作为由光滑功能组成的模型类的改进的通用约束。