Diagonal线性网络的 SGD 隐含比值: 存储的可证实效益 (Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity)

Understanding the implicit bias of training algorithms is of crucial importance in order to explain the success of overparametrised neural networks. In this paper, we study the dynamics of stochastic gradient descent over diagonal linear networks through its continuous time version, namely stochastic gradient flow. We explicitly characterise the solution chosen by the stochastic flow and prove that it always enjoys better generalisation properties than that of gradient flow. Quite surprisingly, we show that the convergence speed of the training loss controls the magnitude of the biasing effect: the slower the convergence, the better the bias. To fully complete our analysis, we provide convergence guarantees for the dynamics. We also give experimental results which support our theoretical claims. Our findings highlight the fact that structured noise can induce better generalisation and they help explain the greater performances observed in practice of stochastic gradient descent over gradient descent.

翻译：理解培训算法的隐含偏差至关重要, 以便解释过度偏差神经网络的成功与否。在本文中, 我们通过持续的时间版本, 即随机偏差梯度流, 研究二角线性网络的悬浮梯度梯度下降动态。我们明确地描述由随机流选择的解决方案, 并证明它总是比梯度流具有更好的概括性特性。非常令人惊讶的是, 我们显示, 培训损失的趋同速度控制着偏差效应的大小: 趋同速度越慢, 偏差越好。为了完全完成我们的分析, 我们为动态提供趋同保证。我们还提供实验结果, 支持我们的理论主张。我们的研究结果突出表明, 结构噪音可以促进更好的概括性, 有助于解释在梯度下降方面观察到的偏差梯度梯度梯度下降的更大性能。