通过本地渐变对齐获得更多强力口译 (Towards More Robust Interpretation via Local Gradient Alignment)

from arxiv, 22 pages (9 pages in paper, 13 pages in Appendix), 9 figures, 6 tables Accepted in AAAI 23 (Association for the Advancement of Artificial Intelligence)

Neural network interpretation methods, particularly feature attribution methods, are known to be fragile with respect to adversarial input perturbations. To address this, several methods for enhancing the local smoothness of the gradient while training have been proposed for attaining \textit{robust} feature attributions. However, the lack of considering the normalization of the attributions, which is essential in their visualizations, has been an obstacle to understanding and improving the robustness of feature attribution methods. In this paper, we provide new insights by taking such normalization into account. First, we show that for every non-negative homogeneous neural network, a naive $\ell_2$-robust criterion for gradients is \textit{not} normalization invariant, which means that two functions with the same normalized gradient can have different values. Second, we formulate a normalization invariant cosine distance-based criterion and derive its upper bound, which gives insight for why simply minimizing the Hessian norm at the input, as has been done in previous work, is not sufficient for attaining robust feature attribution. Finally, we propose to combine both $\ell_2$ and cosine distance-based criteria as regularization terms to leverage the advantages of both in aligning the local gradient. As a result, we experimentally show that models trained with our method produce much more robust interpretations on CIFAR-10 and ImageNet-100 without significantly hurting the accuracy, compared to the recent baselines. To the best of our knowledge, this is the first work to verify the robustness of interpretation on a larger-scale dataset beyond CIFAR-10, thanks to the computational efficiency of our method.

翻译：神经网络解释方法,特别是特征归属方法,在对抗性输入扰动方面,众所周知是脆弱的。为了解决这个问题,在为达到 & textit{robust} 特性属性而提出培训的同时,有几种方法可以提高梯度在当地的平稳性。然而,没有考虑属性正常化,这是这些属性的可视化至关重要,这阻碍了理解和增强特征属性归属方法的稳健性。在本文件中,我们考虑到这种正常化,提供了新的见解。首先,我们表明,对于每一个非负性的单一神经网络来说,一个超越梯度纯度的天真的 $/ell_2$-robust 标准之外,一个天真的地梯度的稳健度标准是\ textitlit{ norbility revority。这意味着,使用同一正常的梯度的两种函数可能具有不同的值。其次,我们制定了一种常态内隐性隐性隐性标准,从上看,为什么仅仅在输入时尽量减少赫森规范,正如以前的工作所做的那样, 不足以实现稳健的特性属性属性属性归属。最后,我们提议将美元/ellral rubal rubalalalalalalalalalalalalalalal lade ladeal lade lade lade 和我们这个在实验法中, lagal laxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx x法 x法 xxxxx法 x的精确法 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx