Despite the predominant use of first-order methods for training deep learning models, second-order methods, and in particular, natural gradient methods, remain of interest because of their potential for accelerating training through the use of curvature information. Several methods with non-diagonal preconditioning matrices, including KFAC and Shampoo, have been proposed and shown to be effective. Based on the so-called tensor normal (TN) distribution, we propose and analyze a brand new approximate natural gradient method, Tensor Normal Training (TNT), which like Shampoo, only requires knowledge on the shape of the training parameters. By approximating the probabilistically based Fisher matrix, as opposed to the empirical Fisher matrix, our method uses the layer-wise covariance of the sampling based gradient as the pre-conditioning matrix. Moreover, the assumption that the sampling-based (tensor) gradient follows a TN distribution, ensures that its covariance has a Kronecker separable structure, which leads to a tractable approximation to the Fisher matrix. Consequently, TNT's memory requirements and per-iteration computational costs are only slightly higher than those for first-order methods. In our experiments, TNT exhibited superior optimization performance to KFAC and Shampoo, and to state-of-the-art first-order methods. Moreover, TNT demonstrated its ability to generalize as well as these first-order methods, using fewer epochs.
翻译:尽管在培训深层次学习模式、第二阶方法、特别是自然梯度方法方面主要使用第一阶方法,但由于有可能通过使用曲线信息加快培训,因此仍然令人感兴趣,因为其有可能通过使用曲线信息加快培训速度。提出了几种非对角先决条件矩阵方法,包括KFAC和Shampoo, 并表明这些方法是有效的。根据所谓的高温正常(TN)分布,我们提议并分析一种崭新的粗略的自然梯度方法,Tensor正常培训(TNT),它与Shampoo一样,只要求了解培训参数的形状。因此,与经验化的Fisher矩阵相比,我们的方法使用基于取样梯度的分层共变异性作为预调节矩阵。此外,我们假设基于采样的梯度在TNT的分布之后,确保其易变性结构为Kronecker seal separble,这导致对渔业矩阵的可感性近近度。因此,TNT的记忆和每平级计算能力矩阵要求,而不是经验化渔业矩阵矩阵矩阵,我们采用这些测试方法,其表现比T-最优的方法。