Understanding the decision process of neural networks is hard. One vital method for explanation is to attribute its decision to pivotal features. Although many algorithms are proposed, most of them solely improve the faithfulness to the model. However, the real environment contains many random noises, which may leads to great fluctuations in the explanations. More seriously, recent works show that explanation algorithms are vulnerable to adversarial attacks. All of these make the explanation hard to trust in real scenarios. To bridge this gap, we propose a model-agnostic method \emph{Median Test for Feature Attribution} (MeTFA) to quantify the uncertainty and increase the stability of explanation algorithms with theoretical guarantees. MeTFA has the following two functions: (1) examine whether one feature is significantly important or unimportant and generate a MeTFA-significant map to visualize the results; (2) compute the confidence interval of a feature attribution score and generate a MeTFA-smoothed map to increase the stability of the explanation. Experiments show that MeTFA improves the visual quality of explanations and significantly reduces the instability while maintaining the faithfulness. To quantitatively evaluate the faithfulness of an explanation under different noise settings, we further propose several robust faithfulness metrics. Experiment results show that the MeTFA-smoothed explanation can significantly increase the robust faithfulness. In addition, we use two scenarios to show MeTFA's potential in the applications. First, when applied to the SOTA explanation method to locate context bias for semantic segmentation models, MeTFA-significant explanations use far smaller regions to maintain 99\%+ faithfulness. Second, when tested with different explanation-oriented attacks, MeTFA can help defend vanilla, as well as adaptive, adversarial attacks against explanations.
翻译:理解神经网络的决策过程是困难的。 解释方法的一个重要方法就是将其决定归结为关键特性。 虽然提出了许多算法, 但大多数算法都只是提高了对模型的忠诚度。 但是, 真实环境包含许多随机噪音, 可能导致解释的大幅波动。 更严重的是, 最近的工作表明解释算法很容易受到对抗性攻击的伤害。 所有这些都使得在真实的情景中难以让人相信解释。 为了缩小这一差距, 我们建议一种模型- 直觉方法 : empph{ Median Testation for Feature atiture } (METFA) 来量化不确定性, 并用理论保证来提高解释的稳定性。 MATFA 有以下两种功能:(1) 检查一个特性是否非常重要或不重要, 并绘制一个巨大的MATFA地图来将结果直观化; (2) 计算特性分数的间隔,并生成MATFA的分布图来增加解释的稳定性。 实验表明, 当直观解释时, MeTFA会提高解释的视觉质量质量, 进一步降低不稳定性, 同时保持准确性解释。