We analyze deep ReLU neural networks trained with mini-batch Stochastic Gradient Descent (SGD) and weight decay. We show, both theoretically and empirically, that when training a neural network using SGD with weight decay and small batch size, the resulting weight matrices tend to be of small rank. Our analysis relies on a minimal set of assumptions; the neural networks may be arbitrarily wide or deep and may include residual connections, as well as convolutional layers. The same analysis implies the inherent presence of SGD "noise", defined as the inability of SGD to converge to a stationary point. In particular, we prove that SGD noise must always be present, even asymptotically, as long as we incorporate weight decay and the batch size is smaller than the total number of training samples.
翻译:我们分析了经过小型散装颗粒裂变源(SGD)和重量衰减训练的深ReLU神经网络。我们从理论上和实验上都表明,在利用SGD进行重量衰减和小批量体积的神经网络培训时,产生的重量矩阵一般是小的。我们的分析依据的是一套最低限度的假设;神经网络可能是任意的或深的,可能包括残留的连接,也可能包括进化层。同样的分析还意味着SGD“噪音”的内在存在,即SGD无法聚集到一个固定点。特别是,我们证明SGD噪音必须始终存在,即使只是暂时存在,只要我们结合重量衰变,而批量大小小于培训样本的总数。