Gradient-based explanation is the cornerstone of explainable deep networks, but it has been shown to be vulnerable to adversarial attacks. However, existing works measure the explanation robustness based on $\ell_p$-norm, which can be counter-intuitive to humans, who only pay attention to the top few salient features. We propose explanation ranking thickness as a more suitable explanation robustness metric. We then present a new practical adversarial attacking goal for manipulating explanation rankings. To mitigate the ranking-based attacks while maintaining computational feasibility, we derive surrogate bounds of the thickness that involve expensive sampling and integration. We use a multi-objective approach to analyze the convergence of a gradient-based attack to confirm that the explanation robustness can be measured by the thickness metric. We conduct experiments on various network architectures and diverse datasets to prove the superiority of the proposed methods, while the widely accepted Hessian-based curvature smoothing approaches are not as robust as our method.
翻译:以渐进为基础的解释是可解释的深层次网络的基石,但事实证明它很容易受到对抗性攻击。然而,现有的工作测量了基于$$_p$-norm的解释稳健度,这对于人类来说可能是反直观的,而人类只关注最少数的突出特征。我们建议将解释性厚度排序作为更合适的解释性强健度衡量标准。我们然后提出一个新的实用的对抗性攻击目标,以操纵解释性排名。为了在保持计算可行性的同时减轻以排名为基础的攻击,我们从厚度中获取代孕,其中涉及昂贵的取样和集成。我们使用多目标方法分析梯度攻击的趋同性,以证实解释稳健度可以用厚度衡量。我们实验了各种网络架构和不同的数据集,以证明拟议方法的优越性,而广泛接受的海珊曲线平滑动方法并不象我们的方法那么有力。