Second-order optimizers are thought to hold the potential to speed up neural network training, but due to the enormous size of the curvature matrix, they typically require approximations to be computationally tractable. The most successful family of approximations are Kronecker-Factored, block-diagonal curvature estimates (KFAC). Here, we combine tools from prior work to evaluate exact second-order updates with careful ablations to establish a surprising result: Due to its approximations, KFAC is not closely related to second-order updates, and in particular, it significantly outperforms true second-order updates. This challenges widely held believes and immediately raises the question why KFAC performs so well. Towards answering this question we present evidence strongly suggesting that KFAC approximates a first-order algorithm, which performs gradient descent on neurons rather than weights. Finally, we show that this optimizer often improves over KFAC in terms of computational cost and data-efficiency.
翻译:第二阶优化器被认为保持了加快神经网络培训的潜力,但由于曲线矩阵的庞大规模,它们通常需要近似值才能进行计算。最成功的近似值组合是Kronecker-Corporate,块对角曲线估计值(KFAC ) 。在这里,我们结合了以前的工作工具来评估精确的第二阶更新,并小心地进行了推导,以得出一个令人惊讶的结果:由于其近似值,KFAC与第二阶更新没有密切联系,特别是它大大超过真正的第二阶更新值。这个广泛存在的挑战令人相信并立即提出了KFAC表现如此好的原因。在回答这个问题时,我们提供了有力的证据,表明KFAC接近于第一阶算法,该算出神经元的梯度下降,而不是重量。最后,我们表明,这一优化器在计算成本和数据效率方面往往比KFAC更好。