We investigate gradient descent training of wide neural networks and the corresponding implicit bias in function space. For univariate regression, we show that the solution of training a width-$n$ shallow ReLU network is within $n^{- 1/2}$ of the function which fits the training data and whose difference from the initial function has the smallest 2-norm of the second derivative weighted by a curvature penalty that depends on the probability distribution that is used to initialize the network parameters. We compute the curvature penalty function explicitly for various common initialization procedures. For instance, asymmetric initialization with a uniform distribution yields a constant curvature penalty, and thence the solution function is the natural cubic spline interpolation of the training data. \hj{For stochastic gradient descent we obtain the same implicit bias result.} We obtain a similar result for different activation functions. For multivariate regression we show an analogous result, whereby the second derivative is replaced by the Radon transform of a fractional Laplacian. For initialization schemes that yield a constant penalty function, the solutions are polyharmonic splines. Moreover, we show that the training trajectories are captured by trajectories of smoothing splines with decreasing regularization strength.
翻译:我们调查了宽度神经网络的梯度下降培训以及功能空间中相应的隐含偏差。 对于单向回归,我们显示培训宽度- 美元浅浅ReLU网络的解决方案是在符合培训数据的函数中 $n ⁇ - 1/ 0. 美元范围内,而该函数与初始函数的差为第二个衍生物的最小 2 向北, 由曲线值加权, 取决于用于初始化网络参数的概率分布。 我们计算了各种通用初始化程序的曲线惩罚函数。 例如, 使用统一分布的不对称初始化将产生一个不变的曲度罚款, 而在此情况下, 解决方案的函数是培训数据的自然立方螺纹内插。 对于与初始函数的差值, 我们获得相同的隐含偏差结果。 } 我们为不同的激活函数获得相似的结果。 对于多种变式回归, 我们展示了一个类似的结果, 即第二个衍生物由分数的拉登转换成一个分数的初始化机制取代。 对于初始化计划来说, 解决方案是常态的曲度螺纹质螺纹线。 此外, 我们显示培训的平稳状态正在通过稳定地变。