Curvature in form of the Hessian or its generalized Gauss-Newton (GGN) approximation is valuable for algorithms that rely on a local model for the loss to train, compress, or explain deep networks. Existing methods based on implicit multiplication via automatic differentiation or Kronecker-factored block diagonal approximations do not consider noise in the mini-batch. We present ViViT, a curvature model that leverages the GGN's low-rank structure without further approximations. It allows for efficient computation of eigenvalues, eigenvectors, as well as per-sample first- and second-order directional derivatives. The representation is computed in parallel with gradients in one backward pass and offers a fine-grained cost-accuracy trade-off, which allows it to scale. We demonstrate this by conducting performance benchmarks and substantiate ViViT's usefulness by studying the impact of noise on the GGN's structural properties during neural network training.
翻译:Hessian 或其通用的 Gaus-Newton (GGN) 近似曲线形式下的曲线对于依赖当地损失模型的算法进行训练、压缩或解释深层次网络是有价值的。基于通过自动区分或Kronecker-factored 区块对角近似等的隐性乘法的现有方法并不考虑微型批量中的噪音。我们介绍了ViViVit,这是一个利用GGN低级结构而无需进一步接近的曲线模型。它使得能够有效地计算egenvalus、egenvectors以及每个样本的一阶和二阶方向衍生物。代表制是同一个落后通道的梯度平行计算的,并提供细微微的成本效益交易,允许其规模化。我们通过进行性能基准,并通过在神经网络培训中研究噪音对GNG的结构性特性的影响来证明ViViT的效用。