We study the conjectured relationship between the implicit regularization in neural networks, trained with gradient-based methods, and rank minimization of their weight matrices. Previously, it was proved that for linear networks (of depth 2 and vector-valued outputs), gradient flow (GF) w.r.t. the square loss acts as a rank minimization heuristic. However, understanding to what extent this generalizes to nonlinear networks is an open problem. In this paper, we focus on nonlinear ReLU networks, providing several new positive and negative results. On the negative side, we prove (and demonstrate empirically) that, unlike the linear case, GF on ReLU networks may no longer tend to minimize ranks, in a rather strong sense (even approximately, for "most" datasets of size 2). On the positive side, we reveal that ReLU networks of sufficient depth are provably biased towards low-rank solutions in several reasonable settings.
翻译:我们研究了神经网络内隐含的正规化、受过梯度方法培训的神经网络内隐含的正规化和减低其重量矩阵的等级之间的推测关系。以前,我们曾证明,对于线性网络(深度2和矢量值产出),平方损失的梯度流(GF)相当于排位最低的超常性。然而,了解这种对非线性网络的概括程度是一个尚未解决的问题。在本文中,我们侧重于非线性ReLU网络,提供了若干新的正反结果。在负面方面,我们证明(并用经验证明),与线性案例不同,雷武网络上的GF可能不再倾向于以相当强烈的意义上(甚至大致上,“最”大小的数据组2 ) 尽量减少排位。 在正面方面,我们发现,具有足够深度的RELU网络在若干合理环境下倾向于低层次的解决方案。