The implicit bias induced by the training of neural networks has become a topic of rigorous study. In the limit of gradient flow and gradient descent with appropriate step size, it has been shown that when one trains a deep linear network with logistic or exponential loss on linearly separable data, the weights converge to rank-1 matrices. In this paper, we extend this theoretical result to the last few linear layers of the much wider class of nonlinear ReLU-activated feedforward networks containing fully-connected layers and skip connections. Similar to the linear case, the proof relies on specific local training invariances, sometimes referred to as alignment, which we show to hold for submatrices where neurons are stably-activated in all training examples, and it reflects empirical results in the literature. We also show this is not true in general for the full matrix of ReLU fully-connected layers. Our proof relies on a specific decomposition of the network into a multilinear function and another ReLU network whose weights are constant under a certain parameter directional convergence.
翻译:神经网络培训的隐含偏差已成为一项严格研究的主题。在梯度流和梯度下降的限度上,适当的步数大小,已经表明当用线性分离数据进行具有逻辑或指数损失的深线性网络培训时,重量会汇合到一级矩阵。在本文中,我们将这一理论结果扩大到非线性ReLU完全连接层和跳过连接的非线性ReLU激活的远端网络最后几层线性线性层次。与线性案例类似,证据依赖于特定的本地培训差异,有时被称作对齐,我们显示在次矩阵中,神经元在所有培训实例中都处于静态或指数性损失状态,并反映了文献中的经验结果。我们也表明,对于完全连接层的RELU的整个矩阵来说,这种理论结果并不普遍。我们的证据依赖于将网络具体分解成多线性函数和另一个RLU网络,其重量在某种参数方向趋同下保持不变。