The training of deep neural networks (DNNs) is currently predominantly done using first-order methods. Some of these methods (e.g., Adam, AdaGrad, and RMSprop, and their variants) incorporate a small amount of curvature information by using a diagonal matrix to precondition the stochastic gradient. Recently, effective second-order methods, such as KFAC, K-BFGS, Shampoo, and TNT, have been developed for training DNNs, by preconditioning the stochastic gradient by layer-wise block-diagonal matrices. Here we propose and analyze the convergence of an approximate natural gradient method, mini-block Fisher (MBF), that lies in between these two classes of methods. Specifically, our method uses a block-diagonal approximation to the Fisher matrix, where for each layer in the DNN, whether it is convolutional or feed-forward and fully connected, the associated diagonal block is also block-diagonal and is composed of a large number of mini-blocks of modest size. Our novel approach utilizes the parallelism of GPUs to efficiently perform computations on the large number of matrices in each layer. Consequently, MBF's per-iteration computational cost is only slightly higher than it is for first-order methods. Finally, the performance of our proposed method is compared to that of several baseline methods, on both Auto-encoder and CNN problems, to validate its effectiveness both in terms of time efficiency and generalization power.
翻译:深神经网络(DNNS)培训目前主要是使用一阶方法完成的,其中一些方法(例如亚当、阿达格拉德、RMSprop和RMSprop及其变体)包含少量弯曲信息,方法是使用对角矩阵,作为随机梯度的先决条件。最近,开发了有效的第二阶方法,如KFAC、K-BFGS、Shampoo和TNT, 培训DNS, 其前提是通过分层分层的分层矩形梯度。在这里,我们提议和分析两种方法之间的近似一般渐变效率方法(MBF)的趋同。具体地说,我们的方法对Fisherm矩阵使用区块-直径近近。对于DNNF的每个层,无论它是革命性的,还是进食的,相关的二阶区块也是块,并且由大量小块块组成。我们用新的方法在两种方法之间趋近的自然梯度方法(MBFF)的趋同性方法中, 也是从GPPO的渐渐测到成本的计算。