K-FAC is a successful tractable implementation of Natural Gradient for Deep Learning, which nevertheless suffers from the requirement to compute the inverse of the Kronecker factors (through an eigen-decomposition). This can be very time-consuming (or even prohibitive) when these factors are large. In this paper, we theoretically show that, owing to the exponential-average construction paradigm of the Kronecker factors that is typically used, their eigen-spectrum must decay. We show numerically that in practice this decay is very rapid, leading to the idea that we could save substantial computation by only focusing on the first few eigen-modes when inverting the Kronecker-factors. Importantly, the spectrum decay happens over a constant number of modes irrespectively of the layer width. This allows us to reduce the time complexity of K-FAC from cubic to quadratic in layer width, partially closing the gap w.r.t. SENG (another practical Natural Gradient implementation for Deep learning which scales linearly in width). Randomized Numerical Linear Algebra provides us with the necessary tools to do so. Numerical results show we obtain $\approx2.5\times$ reduction in per-epoch time and $\approx3.3\times$ reduction in time to target accuracy. We compare our proposed K-FAC sped-up versions SENG, and observe that for CIFAR10 classification with VGG16_bn we perform on par with it.
翻译:K- FAC 是一个成功执行的“ 深学习自然梯度” 成功执行 K- FAC 的自然梯度, 但它仍然受制于计算克伦克尔因素反向因素( 通过 eigen 分解) 的要求。 当这些因素巨大时, 这可能会非常耗时( 甚至令人望而却步) 。 在本文中, 我们理论上显示, 由于通常使用的克伦克尔因素的指数平均建设范式, 他们的脑分光必须衰减。 我们从数字上显示, 实际上这种衰减非常快, 导致这样的想法: 当克伦克尔- 偏差者反转时, 我们只需关注最初的几种乙伦基- 克伦- 变异模型, 就可以节省大量计算。 重要的是, 光谱衰减会发生在一定数量的模式中, 无论层宽度大小如何。 这使我们能够减少K- FAC 从立方到层宽度的二次曲线的复杂性, 部分缩小 w.r. t. Seng( 与深度的深度的深度的深级学习的其他实用自然梯度执行方法), 导致我们只能节省大量计算, 。