We analyze deep ReLU neural networks trained with mini-batch Stochastic Gradient Descent (SGD) and weight decay. We study the source of SGD noise and prove that when training with weight decay, the only solutions of SGD at convergence are zero functions. Furthermore, we show, both theoretically and empirically, that when training a neural network using SGD with weight decay and small batch size, the resulting weight matrices are expected to be of small rank. Our analysis relies on a minimal set of assumptions and the neural networks may be arbitrarily wide or deep, and may include residual connections, as well as batch normalization layers.
翻译:我们分析了经过小型小批次蒸发裂变基因(SGD)和重量衰减训练的深ReLU神经网络。我们研究了SGD噪音的来源,并证明在进行重量衰减训练时,SGD的唯一解决办法是零功能。此外,我们从理论上和从经验上表明,在利用SGD进行重量衰减和小批量体积的神经网络训练时,其产生的重量矩阵预计将是低级的。我们的分析依据的是一套最起码的假设,神经网络可能任意地宽或深,可能包括剩余连接以及分批正常化层。