Efficiently approximating local curvature information of the loss function is a key tool for optimization and compression of deep neural networks. Yet, most existing methods to approximate second-order information have high computational or storage costs, which can limit their practicality. In this work, we investigate matrix-free, linear-time approaches for estimating Inverse-Hessian Vector Products (IHVPs) for the case when the Hessian can be approximated as a sum of rank-one matrices, as in the classic approximation of the Hessian by the empirical Fisher matrix. We propose two new algorithms as part of a framework called M-FAC: the first algorithm is tailored towards network compression and can compute the IHVP for dimension $d$, if the Hessian is given as a sum of $m$ rank-one matrices, using $O(dm^2)$ precomputation, $O(dm)$ cost for computing the IHVP, and query cost $O(m)$ for any single element of the inverse Hessian. The second algorithm targets an optimization setting, where we wish to compute the product between the inverse Hessian, estimated over a sliding window of optimization steps, and a given gradient direction, as required for preconditioned SGD. We give an algorithm with cost $O(dm + m^2)$ for computing the IHVP and $O(dm + m^3)$ for adding or removing any gradient from the sliding window. These two algorithms yield state-of-the-art results for network pruning and optimization with lower computational overhead relative to existing second-order methods. Implementations are available at [10] and [18].
翻译:有效接近损失函数的本地曲线信息( IHVP) 是优化和压缩深层神经网络的关键工具。 然而, 大约二阶信息的大多数现有方法都具有高计算或存储成本, 这可能会限制其实用性。 在这项工作中, 我们调查了用于估算逆向赫西西亚矢量产品( IHVP) 的无矩阵、 线性时间方法, 当赫西亚人可以比近为一级矩阵的总和时, 正如经验化渔业矩阵对赫斯亚人的典型快速缩略缩缩( dm) 一样。 我们提议两种新的算法, 作为称为 M- FAC 的框架的一部分: 第一种算法是针对网络压缩的, 并且可以按尺寸计算 IHVP PLO 的平价值, 使用 $( mm&2) 的预估算, 用于计算 IHVP 的第二位值成本, 以及 将我们想要从 Heserian 的单个元素的平流的平流- 平流- 的平流成本 。