Classically, data interpolation with a parametrized model class is possible as long as the number of parameters is larger than the number of equations to be satisfied. A puzzling phenomenon in deep learning is that models are trained with many more parameters than what this classical theory would suggest. We propose a theoretical explanation for this phenomenon. We prove that for a broad class of data distributions and model classes, overparametrization is necessary if one wants to interpolate the data smoothly. Namely we show that smooth interpolation requires $d$ times more parameters than mere interpolation, where $d$ is the ambient data dimension. We prove this universal law of robustness for any smoothly parametrized function class with polynomial size weights, and any covariate distribution verifying isoperimetry. In the case of two-layers neural networks and Gaussian covariates, this law was conjectured in prior work by Bubeck, Li and Nagaraj.
翻译:典型地说,只要参数数量大于需要满足的方程式数量,就有可能用一个模型类进行数据内插。深层学习中的一个令人费解的现象是,模型所训练的参数比古典理论所暗示的要多得多。我们建议对这种现象进行理论解释。我们证明,对于数据分布和模型类的广泛类别来说,如果想要顺利地将数据内插,则必须进行数据内插。也就是说,光滑的内插需要比单纯的内插多出多出1倍的参数,而光滑的内插是环境数据层面的美元。我们证明,对于任何带有多米尺寸重量的顺利的对称功能类,以及任何可变式分布都具有常识性,可以验证等宽度。在两层神经网络和高斯的共变数中,这项法律是Bubeck、Li和Nagaraj先前工作中的。