We study the implicit bias of gradient flow (i.e., gradient descent with infinitesimal step size) on linear neural network training. We propose a tensor formulation of neural networks that includes fully-connected, diagonal, and convolutional networks as special cases, and investigate the linear version of the formulation called linear tensor networks. With this formulation, we can characterize the convergence direction of the network parameters as singular vectors of a tensor defined by the network. For $L$-layer linear tensor networks that are orthogonally decomposable, we show that gradient flow on separable classification finds a stationary point of the $\ell_{2/L}$ max-margin problem in a "transformed" input space defined by the network. For underdetermined regression, we prove that gradient flow finds a global minimum which minimizes a norm-like function that interpolates between weighted $\ell_1$ and $\ell_2$ norms in the transformed input space. Our theorems subsume existing results in the literature while removing standard convergence assumptions. We also provide experiments that corroborate our analysis.
翻译:我们研究了线性神经网络培训中梯度流(即梯度下降,且脚步尺寸极小)的隐含偏差。我们建议作为特例,对包含完全连接、对角和进化网络的神经网络进行配方配方配方配方配方进行配方配方的线性版式研究。有了这种配方,我们可以将网络参数的趋同方向定性为由网络定义的单向矢量的单向矢量。对于正折向不易碎的 $-lea-lean 线性抗体网络,我们表明,分解分类中的梯度流在网络界定的“变形”输入空间中发现了一个固定点,即“变形”输入空间中发现一个最大负值。关于未定的回归,我们证明梯度流具有全球最低值,从而最大限度地减少在变换输入空间中加权 $1美元和 $ $ ell_ 美元 和 $ $ 0. 2 美元 的规范之间的一种规范。对于正值,我们的理论在文献中包含了现有结果,同时消除标准趋同假设。我们的分析也提供了实验。