The training of neural networks by gradient descent methods is a cornerstone of the deep learning revolution. Yet, despite some recent progress, a complete theory explaining its success is still missing. This article presents, for orthogonal input vectors, a precise description of the gradient flow dynamics of training one-hidden layer ReLU neural networks for the mean squared error at small initialisation. In this setting, despite non-convexity, we show that the gradient flow converges to zero loss and characterise its implicit bias towards minimum variation norm. Furthermore, some interesting phenomena are highlighted: a quantitative description of the initial alignment phenomenon and a proof that the process follows a specific saddle to saddle dynamics.
翻译:以梯度下降法对神经网络进行培训是深层学习革命的基石。 然而,尽管最近取得了一些进展,但解释其成功与否的完整理论仍然缺乏。 对于正向输入矢量而言,这一条准确地描述了在小初始化时对单层顶层ReLU神经网络进行平均平方错误培训的梯度流动动态。 在这种背景下,尽管非混凝土,我们还是表明,梯度流会达到零损失,并表明其隐含的偏向于最小变异规范。 此外,一些有趣的现象也得到了强调:对初始匹配现象的量化描述,以及证明这一过程遵循了特定的马鞍动力。