Model attributions are important in deep neural networks as they aid practitioners in understanding the models, but recent studies reveal that attributions can be easily perturbed by adding imperceptible noise to the input. The non-differentiable Kendall's rank correlation is a key performance index for attribution protection. In this paper, we first show that the expected Kendall's rank correlation is positively correlated to cosine similarity and then indicate that the direction of attribution is the key to attribution robustness. Based on these findings, we explore the vector space of attribution to explain the shortcomings of attribution defense methods using $\ell_p$ norm and propose integrated gradient regularizer (IGR), which maximizes the cosine similarity between natural and perturbed attributions. Our analysis further exposes that IGR encourages neurons with the same activation states for natural samples and the corresponding perturbed samples, which is shown to induce robustness to gradient-based attribution methods. Our experiments on different models and datasets confirm our analysis on attribution protection and demonstrate a decent improvement in adversarial robustness.
翻译:模型属性在深层神经网络中很重要,因为它们有助于从业者理解模型,但最近的研究表明,通过在输入中添加不可察觉的噪音,可以很容易地干扰属性。不可区分的肯德尔的等级相关关系是属性保护的主要性能指数。在本文中,我们首先表明,预期肯德尔的等级相关关系与共性相似性具有积极关联性,然后表明归属方向是归属稳健性的关键。根据这些调查结果,我们探索了归因空间,以解释使用$\ell_p$规范的归因防御方法的缺陷,并提出了综合梯度整变法化器(IGR),以尽可能扩大自然和相近性属性之间的共性相似性。我们进一步的分析表明,在自然样本和相应的扰动样品方面,IGR鼓励神经元具有相同的激活状态,这证明可以促使梯度归属方法稳健。我们在不同模型和数据集上进行的实验证实了我们对归因保护属性的分析,并展示了对抗性强性强度的体面改进。