When do gradient-based explanation algorithms provide meaningful explanations? We propose a necessary criterion: their feature attributions need to be aligned with the tangent space of the data manifold. To provide evidence for this hypothesis, we introduce a framework based on variational autoencoders that allows to estimate and generate image manifolds. Through experiments across a range of different datasets -- MNIST, EMNIST, CIFAR10, X-ray pneumonia and Diabetic Retinopathy detection -- we demonstrate that the more a feature attribution is aligned with the tangent space of the data, the more structured and explanatory it tends to be. In particular, the attributions provided by popular post-hoc methods such as Integrated Gradients, SmoothGrad and Input $\times$ Gradient tend to be more strongly aligned with the data manifold than the raw gradient. As a consequence, we suggest that explanation algorithms should actively strive to align their explanations with the data manifold. In part, this can be achieved by adversarial training, which leads to better alignment across all datasets. Some form of adjustment to the model architecture or training algorithm is necessary, since we show that generalization of neural networks alone does not imply the alignment of model gradients with the data manifold.
翻译:当基于梯度的解释算法提供有意义的解释时,我们建议了一个必要的标准:其特性属性需要与数据方格的正切空间相匹配。为了提供这一假设的证据,我们引入了一个基于可变自动代数的框架,允许估算和生成图像元数。通过一系列不同的数据集 -- -- MNIST、EMNIST、EMIS、CIFAR10、X光肺炎和糖尿病 Retinopath 检测 -- -- 我们证明,特性属性的属性越与数据的正切空间相匹配,其结构化和解释性就越大。特别是,流行的后热方法,例如综合梯度、平滑度和输入值等提供的属性,往往比原始梯度高得多。结果,我们建议解释性算法应该积极地努力使其解释与数据方位一致。这部分可以通过对抗性培训实现,这可以导致所有数据集更趋一致。对模型结构或培训算法进行某种形式的调整是必要的,因为我们显示,光量值网络的渐变率意味着,光值网络本身就意味着数据渐变。