Classically, data interpolation with a parametrized model class is possible as long as the number of parameters is larger than the number of equations to be satisfied. A puzzling phenomenon in deep learning is that models are trained with many more parameters than what this classical theory would suggest. We propose a theoretical explanation for this phenomenon. We prove that for a broad class of data distributions and model classes, overparametrization is necessary if one wants to interpolate the data smoothly. Namely we show that smooth interpolation requires $d$ times more parameters than mere interpolation, where $d$ is the ambient data dimension. We prove this universal law of robustness for any smoothly parametrized function class with polynomial size weights, and any covariate distribution verifying isoperimetry. In the case of two-layers neural networks and Gaussian covariates, this law was conjectured in prior work by Bubeck, Li and Nagaraj. We also give an interpretation of our result as an improved generalization bound for model classes consisting of smooth functions.
翻译:典型地说,只要参数数量大于需要满足的方程式数量,就有可能用一个模型类进行数据内插。深层学习中的一个令人费解的现象是,模型所培训的参数比古典理论所暗示的要多得多。我们为这一现象提出了一个理论解释。我们证明,对于数据分布和模型类的广泛类别来说,如果一个人想顺利地将数据内插,那么就有必要进行数据内插。也就是说,我们表明,光滑的内插需要比单纯的内插多出多一倍的参数,多一倍的参数,而光滑的内插是环境数据层面的美元。我们证明,这种常识性强法对于具有多米体积重量的任何顺利的对准功能类,以及任何对共变式分布进行校准的测量都具有普遍法则。在两层神经网络和高斯古亚变异体的案例中,这项法律是布贝克、李和纳加拉杰的先前工作所推导的。我们还解释了我们的结果,即改进了由光滑体功能组成的模型类组成的一般化。