In this paper, we study the bias of Stochastic Gradient Descent (SGD) to learn low-rank weight matrices when training deep ReLU neural networks. Our results show that training neural networks with mini-batch SGD and weight decay causes a bias towards rank minimization over the weight matrices. Specifically, we show, both theoretically and empirically, that this bias is more pronounced when using smaller batch sizes, higher learning rates, or increased weight decay. Additionally, we predict and observe empirically that weight decay is necessary to achieve this bias. Finally, we empirically investigate the connection between this bias and generalization, finding that it has a marginal effect on generalization. Our analysis is based on a minimal set of assumptions and applies to neural networks of any width or depth, including those with residual connections and convolutional layers.
翻译:在本文中,我们研究了Stochatic Gradient Emplement(SGD)在培训深ReLU神经网络时学习低级体重矩阵的偏差。我们的研究结果表明,对微型批量 SGD 和 重量衰减的神经网络进行培训,会给人造成一种对减低重量矩阵的偏差。具体地说,我们在理论上和实践中都表明,在使用小批量尺寸、高学习率或增加重量衰减时,这种偏差会更加明显。此外,我们从经验上预测和观察,重量衰减对于实现这一偏差是必要的。最后,我们从经验上调查了这种偏差与概括之间的联系,发现它对普遍化具有边际效应。我们的分析基于一套最起码的假设,并适用于任何宽度或深度的神经网络,包括那些有剩余连接和相交层的网络。