输入梯度高亮突出偏差特性吗? (Do Input Gradients Highlight Discriminative Features?)

Post-hoc gradient-based interpretability methods [Simonyan et al., 2013, Smilkov et al., 2017] that provide instance-specific explanations of model predictions are often based on assumption (A): magnitude of input gradients -- gradients of logits with respect to input -- noisily highlight discriminative task-relevant features. In this work, we test the validity of assumption (A) using a three-pronged approach. First, we develop an evaluation framework, DiffROAR, to test assumption (A) on four image classification benchmarks. Our results suggest that (i) input gradients of standard models (i.e., trained on original data) may grossly violate (A), whereas (ii) input gradients of adversarially robust models satisfy (A). Second, we introduce BlockMNIST, an MNIST-based semi-real dataset, that by design encodes a priori knowledge of discriminative features. Our analysis on BlockMNIST leverages this information to validate as well as characterize differences between input gradient attributions of standard and robust models. Finally, we theoretically prove that our empirical findings hold on a simplified version of the BlockMNIST dataset. Specifically, we prove that input gradients of standard one-hidden-layer MLPs trained on this dataset do not highlight instance-specific signal coordinates, thus grossly violating assumption (A). Our findings motivate the need to formalize and test common assumptions in interpretability in a falsifiable manner [Leavitt and Morcos, 2020]. We believe that the DiffROAR evaluation framework and BlockMNIST-based datasets can serve as sanity checks to audit instance-specific interpretability methods; code and data available at https://github.com/harshays/inputgradients.

翻译：在这项工作中,我们用三管齐下的方法测试假设(A)的有效性。首先,我们开发了一个评估框架(DiffROAR),以测试四个图像分类基准的假设(A)。我们的结果表明(一)标准模型(即受过原始数据培训的)输入梯度可能严重违背(A),而(二)具有对抗性强模型的输入梯度满足(A)。第二,我们引入了基于MDIST的半现实数据集BlockMNIST,这是基于MDIST的半现实数据集。我们用一种三管齐下的方法来解读假设(A)的正确性。我们对MISIC的分析利用了这一信息来验证(A)在四个图像分类基准上的假设(A)。我们发现(一)标准模型的输入梯度梯度(即经过原始数据培训的)可能会严重违背(A),而(二)具有对抗性强势模型的输入梯度模型的输入梯度(A)。其次,我们引入了基于MNIST的模型的常规测试性数据,这证明我们用了一个简化的测试性数据格式,我们用了一个测试性数据模型来解释。