Efficiently approximating local curvature information of the loss function is a key tool for optimization and compression of deep neural networks. Yet, most existing methods to approximate second-order information have high computational or storage costs, which can limit their practicality. In this work, we investigate matrix-free, linear-time approaches for estimating Inverse-Hessian Vector Products (IHVPs) for the case when the Hessian can be approximated as a sum of rank-one matrices, as in the classic approximation of the Hessian by the empirical Fisher matrix. We propose two new algorithms as part of a framework called M-FAC: the first algorithm is tailored towards network compression and can compute the IHVP for dimension $d$, if the Hessian is given as a sum of $m$ rank-one matrices, using $O(dm^2)$ precomputation, $O(dm)$ cost for computing the IHVP, and query cost $O(m)$ for any single element of the inverse Hessian. The second algorithm targets an optimization setting, where we wish to compute the product between the inverse Hessian, estimated over a sliding window of optimization steps, and a given gradient direction, as required for preconditioned SGD. We give an algorithm with cost $O(dm + m^2)$ for computing the IHVP and $O(dm + m^3)$ for adding or removing any gradient from the sliding window. These two algorithms yield state-of-the-art results for network pruning and optimization with lower computational overhead relative to existing second-order methods. Implementations are available at [9] and [17].
翻译:高效地接近损失函数的本地曲线信息是优化和压缩深层神经网络的关键工具。 然而,大多数现有的近似二阶信息的方法都具有高计算或存储成本,这可能会限制其实用性。 在这项工作中,我们调查了用于估算反赫西西亚矢量产品(IHVP)的无矩阵、线性时间方法,当赫西安人可以被比作一级矩阵的总和时,就像经验化渔业矩阵对赫西亚人的典型快速缩略图一样。我们建议了两个新的算法,作为称为M-FAC的框架的一部分:第一个算法是针对网络压缩的,并且可以按维特价计算IHVP的平价,如果赫西安人以一级矩阵总和美元计算,那么计算IHVP的第二位成本,然后用我们所希望的任何单元素在 Heseria-rorma 的平流成本。