Saliency methods can make deep neural network predictions more interpretable by identifying a set of critical features in an input sample, such as pixels that contribute most strongly to a prediction made by an image classifier. Unfortunately, recent evidence suggests that many saliency methods poorly perform, especially in situations where gradients are saturated, inputs contain adversarial perturbations, or predictions rely upon inter-feature dependence. To address these issues, we propose a framework that improves the robustness of saliency methods by following a two-step procedure. First, we introduce a perturbation mechanism that subtly varies the input sample without changing its intermediate representations. Using this approach, we can gather a corpus of perturbed data samples while ensuring that the perturbed and original input samples follow the same distribution. Second, we compute saliency maps for the perturbed samples and propose a new method to aggregate saliency maps. With this design, we offset the gradient saturation influence upon interpretation. From a theoretical perspective, we show the aggregated saliency map could not only capture inter-feature dependence but, more importantly, robustify interpretation against previously described adversarial perturbation methods. Following our theoretical analysis, we present experimental results suggesting that, both qualitatively and quantitatively, our saliency method outperforms existing methods.
翻译:为了解决这些问题,我们提出了一个框架,通过采取两步程序来提高显著度方法的稳健性。首先,我们引入了一种扰动性图解机制,在不改变中间表示的情况下,在不改变其中间表示方式的情况下,我们可收集一系列被扰动的数据样本,同时确保被扰动的和原始输入样本遵循同样的分布。第二,我们为被扰动的样本绘制了突出度图,并提出了一种汇总突出度图的新方法。通过这一设计,我们抵消了对解释的梯度饱和影响。从理论上看,我们展示了汇总突出度图不仅可以捕捉到内部依赖性,而且更为重要的是,我们可以收集一系列被扰动的数据样本,同时确保被扰动的和原始输入样本遵循同样的分布。第二,我们为被扰动的样本绘制了突出度图,并提出了一个新的方法来综合显著度图。我们从理论上的角度看,我们展示了综合突出度图不仅能够捕捉到对物理的依赖性,而且能够更有力地根据我们以前描述的理论分析方法,我们目前提出的理论性分析方法提出了我们现有的质量分析方法。