Understanding the properties of neural networks trained via stochastic gradient descent (SGD) is at the heart of the theory of deep learning. In this work, we take a mean-field view, and consider a two-layer ReLU network trained via SGD for a univariate regularized regression problem. Our main result is that SGD is biased towards a simple solution: at convergence, the ReLU network implements a piecewise linear map of the inputs, and the number of "knot" points - i.e., points where the tangent of the ReLU network estimator changes - between two consecutive training inputs is at most three. In particular, as the number of neurons of the network grows, the SGD dynamics is captured by the solution of a gradient flow and, at convergence, the distribution of the weights approaches the unique minimizer of a related free energy, which has a Gibbs form. Our key technical contribution consists in the analysis of the estimator resulting from this minimizer: we show that its second derivative vanishes everywhere, except at some specific locations which represent the "knot" points. We also provide empirical evidence that knots at locations distinct from the data points might occur, as predicted by our theory.
翻译:了解通过随机梯度下降(SGD)训练的神经网络的特性是深层学习理论的核心。 在这项工作中,我们从中得出一个中位观点,考虑一个通过 SGD 训练的双层RELU 网络, 解决一个单一的静态回归问题。 我们的主要结果是 SGD偏向于一个简单的解决方案: 在趋同时, RELU 网络执行一个输入的片断线性线性图, 以及“ knot” 点的数量, 即ReLU 网络估计值变化的切点, 是在两个连续的培训投入之间, 最多有三个。 特别是, 随着网络神经的数量增加, SGD 动态被一个梯度流的解决方案所捕捉到, 在趋同时, 重量的分布接近于一个相关自由能源的独特最小化器, 也就是一个 Globsbs 。 我们的主要技术贡献在于分析从此最小化器中得出的测量器数量: 我们显示, 其第二次衍生物在任何地方消失, 除了代表“ knot” 点的某些特定地点。 我们还提供了从预测中的不同数据。