Deep neural networks (DNNs) are currently predominantly trained using first-order methods. Some of these methods (e.g., Adam, AdaGrad, and RMSprop, and their variants) incorporate a small amount of curvature information by using a diagonal matrix to precondition the stochastic gradient. Recently, effective second-order methods, such as KFAC, K-BFGS, Shampoo, and TNT, have been developed for training DNNs, by preconditioning the stochastic gradient by layer-wise block-diagonal matrices. Here we propose a "mini-block Fisher (MBF)" preconditioned gradient method, that lies in between these two classes of methods. Specifically, our method uses a block-diagonal approximation to the empirical Fisher matrix, where for each layer in the DNN, whether it is convolutional or feed-forward and fully connected, the associated diagonal block is itself block-diagonal and is composed of a large number of mini-blocks of modest size. Our novel approach utilizes the parallelism of GPUs to efficiently perform computations on the large number of matrices in each layer. Consequently, MBF's per-iteration computational cost is only slightly higher than it is for first-order methods. The performance of our proposed method is compared to that of several baseline methods, on both autoencoder and CNN problems, to validate its effectiveness both in terms of time efficiency and generalization power. Finally, it is proved that an idealized version of MBF converges linearly.
翻译:目前,深神经网络(DNNS)主要是使用一级方法培训的,其中一些方法(例如Adam、AdaGrad、RMSprop和RMSprop及其变体)包含少量弯曲信息,方法是使用对角矩阵,作为随机梯度的先决条件。最近,开发了有效的二级方法,如KFAC、K-BFGS、Shampoo和TNT, 用于培训DNS, 其前提是通过层- 区对角矩阵来预设随机梯度梯度。在这里,我们提议了一种“微盘Fisher(MBF)” 设定的梯度方法,该方法存在于这两种方法之间。具体地说,我们的方法对经验型Fisherm矩阵使用一个区对角矩阵近距离近距离。对于DNNMT的每个层来说,无论它是进向的还是进向的,还是完全连接的,相关的对角板块本身是块的直径对角,由大量小型的微直径方块组成。我们的新方法利用GPUs的平行法, 和直径直路路路路路运的计算方法都是一次。在一次的直路路路路路路路路路路运的计算方法中,最后方法, 。在每一级计算方法中,其总路路路路路路路路路路路路路路路路路。