Interpretability methods for deep neural networks mainly focus on the sensitivity of the class score with respect to the original or perturbed input, usually measured using actual or modified gradients. Some methods also use a model-agnostic approach to understanding the rationale behind every prediction. In this paper, we argue and demonstrate that local geometry of the model parameter space relative to the input can also be beneficial for improved post-hoc explanations. To achieve this goal, we introduce an interpretability method called "geometrically-guided integrated gradients" that builds on top of the gradient calculation along a linear path as traditionally used in integrated gradient methods. However, instead of integrating gradient information, our method explores the model's dynamic behavior from multiple scaled versions of the input and captures the best possible attribution for each input. We demonstrate through extensive experiments that the proposed approach outperforms vanilla and integrated gradients in subjective and quantitative assessment. We also propose a "model perturbation" sanity check to complement the traditionally used "model randomization" test.
翻译:深神经网络的可解释性方法主要侧重于对原输入或扰动输入的等级分的敏感度,通常使用实际或修改梯度来衡量。有些方法还使用模型的不可知性方法来理解每一项预测背后的理由。在本文中,我们争论并表明,模型参数空间相对于输入的本地几何也有利于改进热后解释。为了实现这一目标,我们引入了一种称为“几何制导综合梯度”的可解释性方法,该方法建立在梯度计算顶部沿一种线性路径上,这是在综合梯度方法中传统上使用的一种线性路径上。然而,我们的方法不是整合梯度信息,而是从多种规模的投入中探索模型的动态行为,并捕捉每种输入的最佳属性。我们通过广泛的实验证明,拟议的方法在主观和定量评估中超越了香草和综合梯度。我们还提议了一种“模型渗透性”随机性检查,以补充传统上使用的“模型随机化”测试。