Current methods for the interpretability of discriminative deep neural networks commonly rely on the model's input-gradients, i.e., the gradients of the output logits w.r.t. the inputs. The common assumption is that these input-gradients contain information regarding $p_{\theta} ( y \mid x)$, the model's discriminative capabilities, thus justifying their use for interpretability. However, in this work we show that these input-gradients can be arbitrarily manipulated as a consequence of the shift-invariance of softmax without changing the discriminative function. This leaves an open question: if input-gradients can be arbitrary, why are they highly structured and explanatory in standard models? We investigate this by re-interpreting the logits of standard softmax-based classifiers as unnormalized log-densities of the data distribution and show that input-gradients can be viewed as gradients of a class-conditional density model $p_{\theta}(x \mid y)$ implicit within the discriminative model. This leads us to hypothesize that the highly structured and explanatory nature of input-gradients may be due to the alignment of this class-conditional model $p_{\theta}(x \mid y)$ with that of the ground truth data distribution $p_{\text{data}} (x \mid y)$. We test this hypothesis by studying the effect of density alignment on gradient explanations. To achieve this alignment we use score-matching, and propose novel approximations to this algorithm to enable training large-scale models. Our experiments show that improving the alignment of the implicit density model with the data distribution enhances gradient structure and explanatory power while reducing this alignment has the opposite effect. Overall, our finding that input-gradients capture information regarding an implicit generative model implies that we need to re-think their use for interpreting discriminative models.
翻译:具有歧视性的深神经网络的当前可解释性方法通常依赖于模型的输入梯度, 也就是输出梯度的梯度 。 常见的假设是, 这些输入梯度包含关于 $p ⁇ theta} ( y\ mid x) 美元的信息, 模型的偏向性能力, 从而证明它们用于解释性。 但是, 在这项工作中, 这些输入梯度可以任意地被操纵, 因为它会改变软体的变向性调整, 而不会改变分析性功能 。 这就留下了一个尚未解决的问题 : 如果输入梯度的梯度可能是任意的, 为什么它们结构化和解释性在标准模型中是高度结构化的? 我们通过重新解释基于标准软体格的校正性校正的校正性, 并表明, 其输入梯度可以被视为一个等级- 密度模型的梯度 $p ⁇ - lix 。 在这种分析性模型中, 我们的变正值 将这种变正值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值 值