The recent success of Shampoo in the AlgoPerf contest has sparked renewed interest in Kronecker-factorization-based optimization algorithms for training neural networks. Despite its success, Shampoo relies heavily on several heuristics such as learning rate grafting and stale preconditioning to achieve performance at-scale. These heuristics increase algorithmic complexity, necessitate further hyperparameter tuning, and lack theoretical justification. This paper investigates these heuristics from the angle of Frobenius norm approximation to full-matrix Adam and decouples the preconditioner's eigenvalues and eigenbasis updates. We show that grafting from Adam mitigates the staleness and mis-scaling of the preconditioner's eigenvalues and how correcting the eigenvalues directly eliminates the need for learning rate grafting. To manage the error induced by infrequent eigenbasis computations, we propose an adaptive criterion for determining the eigenbasis computation frequency motivated by terminating a warm-started QR algorithm. This criterion decouples the update frequency of different preconditioner matrices and enables us to investigate the impact of approximation error on convergence. These practical techniques offer a principled angle towards removing Shampoo's heuristics and developing improved Kronecker-factorization-based training algorithms.
翻译:近期Shampoo算法在AlgoPerf竞赛中的成功,重新激发了基于Kronecker分解的优化算法在神经网络训练中的研究兴趣。尽管成效显著,Shampoo高度依赖学习率嫁接和过时预条件处理等多种启发式策略以实现大规模性能。这些启发式方法增加了算法复杂度,需要额外的超参数调优,且缺乏理论依据。本文从Frobenius范数逼近全矩阵Adam的角度探究这些启发式策略,并将预条件子的特征值与特征基更新解耦。我们证明,从Adam嫁接可缓解预条件子特征值的过时性与尺度失配问题,而直接修正特征值可消除学习率嫁接的需求。为管理低频特征基计算引入的误差,我们基于预热启动QR算法的终止条件,提出一种自适应准则以确定特征基计算频率。该准则解耦了不同预条件子矩阵的更新频率,使我们能够研究逼近误差对收敛的影响。这些实用技术为消除Shampoo的启发式策略及开发改进的基于Kronecker分解的训练算法提供了原理性视角。