There is broad agreement in the literature that explanation methods should be faithful to the model that they explain, but faithfulness remains a rather vague term. We revisit faithfulness in the context of continuous data and propose two formal definitions of faithfulness for feature attribution methods. Qualitative faithfulness demands that scores reflect the true qualitative effect (positive vs. negative) of the feature on the model and quanitative faithfulness that the magnitude of scores reflect the true quantitative effect. We discuss under which conditions these requirements can be satisfied to which extent (local vs global). As an application of the conceptual idea, we look at differentiable classifiers over continuous data and characterize Gradient-scores as follows: every qualitatively faithful feature attribution method is qualitatively equivalent to Gradient-scores. Furthermore, if an attribution method is quantitatively faithful in the sense that changes of the output of the classifier are proportional to the scores of features, then it is either equivalent to gradient-scoring or it is based on an inferior approximation of the classifier. To illustrate the practical relevance of the theory, we experimentally demonstrate that popular attribution methods can fail to give faithful explanations in the setting where the data is continuous and the classifier differentiable.
翻译:文献中广泛一致认为,解释方法应该忠实于它们解释的模式,但忠诚仍然是一个相当模糊的术语。我们重新审视连续数据背景下的忠诚性,并就特征归属方法提出了两种正式的忠诚性定义。定性忠诚性要求分数反映模型特征的真正质量效果(正对负),四分性忠诚性要求分数的大小反映真实的量化效果。我们讨论在何种条件下可以满足这些要求(地方对全球)的程度(地方对全球),作为概念理念的应用,我们审视对连续数据的不同分类者,并将 " 梯子 " 分数定性如下:每个质量忠实特征归属方法在质量上等同于 " 梯子 " 。此外,如果归分法在数量上是忠实的,即分类器的输出变化与特征分数成成比例成比例成正比,那么它要么相当于梯度的分数,要么基于分类者的低等近度。作为理论的实际相关性的例证,我们实验性地表明,在不同的分类中,大众归属方法不能对数据进行可靠的解释。