Curvature in form of the Hessian or its generalized Gauss-Newton (GGN) approximation is valuable for algorithms that rely on a local model for the loss to train, compress, or explain deep networks. Existing methods based on implicit multiplication via automatic differentiation or Kronecker-factored block diagonal approximations do not consider noise in the mini-batch. We present ViViT, a curvature model that leverages the GGN's low-rank structure without further approximations. It allows for efficient computation of eigenvalues, eigenvectors, as well as per-sample first- and second-order directional derivatives. The representation is computed in parallel with gradients in one backward pass and offers a fine-grained cost-accuracy trade-off, which allows it to scale. As examples for ViViT's usefulness, we investigate the directional gradients and curvatures during training, and how noise information can be used to improve the stability of second-order methods.
翻译:Hessian 或其通用的 Gaus- Newton (GGN) 近似曲线形式下的曲线对于依赖当地损失模型的算法进行训练、压缩或解释深层次网络是有价值的。基于通过自动区分或Kronecker- faciled block diagon countal coupilation的隐含乘法的现有方法并不考虑微型批量中的噪音。 我们介绍了ViViVT, 这是一种利用GGGN低级结构而无需进一步接近的曲线模型。 它使得能够有效地计算egenvals、 eigenvectors 和 per sample 一阶和二阶一阶方向衍生物。 表示方式与一个后端通道的梯度平行计算, 并提供细度成本- 准确性交易, 从而可以进行缩放。 作为 ViviT 的有用性示例, 我们在培训期间调查方向梯度和曲度结构, 以及如何使用噪音信息来提高二阶方法的稳定性。