Optimization algorithms that leverage gradient covariance information, such as variants of natural gradient descent (Amari, 1998), offer the prospect of yielding more effective descent directions. For models with many parameters, the covariance matrix they are based on becomes gigantic, making them inapplicable in their original form. This has motivated research into both simple diagonal approximations and more sophisticated factored approximations such as KFAC (Heskes, 2000; Martens & Grosse, 2015; Grosse & Martens, 2016). In the present work we draw inspiration from both to propose a novel approximation that is provably better than KFAC and amendable to cheap partial updates. It consists in tracking a diagonal variance, not in parameter coordinates, but in a Kronecker-factored eigenbasis, in which the diagonal approximation is likely to be more effective. Experiments show improvements over KFAC in optimization speed for several deep network architectures.
翻译:优化算法能够利用梯度共差信息,例如自然梯度下降变量(Amari,1998年),提供了产生更有效下降方向的前景。对于具有许多参数的模型,它们基于巨大的共差矩阵,使得它们无法以原始形式适用。这激发了对简单的对角近似值和诸如KFAC(Heskes,2000年;Martens & Grosse,2015年;Grosse & Martens,2016年)等更复杂的因子近差的研究。在目前的工作中,我们从两方面汲取了灵感,提出了比KFAC更好的新近差,可以修改为廉价的部分更新。它包括跟踪对角差异,不是在参数坐标上,而是在Kronecker-factored egenbasis,其中对角近差可能更加有效。实验显示,在KFAC对若干深网络结构的优化速度方面比KFAC有所改进。