We investigate gradient descent training of wide neural networks and the corresponding implicit bias in function space. For univariate regression, we show that the solution of training a width-$n$ shallow ReLU network is within $n^{- 1/2}$ of the function which fits the training data and whose difference from the initial function has the smallest 2-norm of the second derivative weighted by a curvature penalty that depends on the probability distribution that is used to initialize the network parameters. We compute the curvature penalty function explicitly for various common initialization procedures. For instance, asymmetric initialization with a uniform distribution yields a constant curvature penalty, and thence the solution function is the natural cubic spline interpolation of the training data. We obtain a similar result for different activation functions. For multivariate regression we show an analogous result, whereby the second derivative is replaced by the Radon transform of a fractional Laplacian. For initialization schemes that yield a constant penalty function, the solutions are polyharmonic splines. Moreover, we show that the training trajectories are captured by trajectories of smoothing splines with decreasing regularization strength.
翻译:我们调查了宽度神经网络的梯度下降培训和功能空间中相应的隐含偏差。 对于单向回归,我们显示培训宽度- 美元浅浅ReLU网络的解决方案是在符合培训数据的函数中 $n ⁇ - 1/ }美元范围内,而该函数与初始函数的差值在第二个衍生函数中最小的2- 诺尔以曲线值加权,这取决于用于初始化网络参数的概率分布。我们计算了各种通用初始化程序的曲度惩罚函数。例如,使用统一分布的不对称初始化产生一个不变的曲度罚款,而在此情况下,解决方案函数是培训数据的自然立方螺纹内插。我们为不同的激活函数获得了类似的结果。对于多变量回归,我们展示了一个类似的结果,即第二个衍生值由拉登转换一个小色素的分数曲线值值值值来取代。对于初始化计划来说,产生恒定惩罚功能的解决方案是多相调质定的。此外,我们显示,培训轨迹通过平滑式螺纹的轨来捕捉取。