Attribution methods have been developed to understand the decision making process of machine learning models, especially deep neural networks, by assigning importance scores to individual features. Existing attribution methods often built upon empirical intuitions and heuristics. There still lacks a unified framework that can provide deeper understandings of their rationales, theoretical fidelity, and limitations. To bridge the gap, we present a Taylor attribution framework to theoretically characterize the fidelity of explanations. The key idea is to decompose model behaviors into first-order, high-order independent, and high-order interactive terms, which makes clearer attribution of high-order effects and complex feature interactions. Three desired properties are proposed for Taylor attributions, i.e., low model approximation error, accurate assignment of independent and interactive effects. Moreover, several popular attribution methods are mathematically reformulated under the unified Taylor attribution framework. Our theoretical investigations indicate that these attribution methods implicitly reflect high-order terms involving complex feature interdependencies. Among these methods, Integrated Gradient is the only one satisfying the proposed three desired properties. New attribution methods are proposed based on Integrated Gradient by utilizing the Taylor framework. Experimental results show that the proposed method outperforms the existing ones in model interpretations.
翻译:为了理解机器学习模型的决策过程,特别是深神经网络,已经开发了归因方法,以理解机器学习模型,特别是深神经网络的决策进程,方法是将重要分数分配给各个特点; 现有的归因方法往往以经验直觉和累进论为基础; 仍然缺乏一个能够更深入地了解其原理、理论忠诚和局限性的统一框架; 为了弥合这一差距,我们提出了一个泰勒归属框架,以理论方式描述解释的忠诚性; 关键的想法是将模型行为分解为一阶、高阶独立和高档互动术语,使高阶效应和复杂特征互动的归属更加明确; 提出了泰勒归属的三个预期属性,即低型近似误差、独立和互动效应的准确分配; 此外,在统一的泰勒归属框架下,一些流行的归因重新作出数学调整; 我们的理论调查表明,这些归因方法隐含了涉及复杂特征相互依存的高度等级术语。 在这些方法中,综合归因是唯一一种满足拟议的三种预期属性。 新的归因方法是利用泰勒模型框架以综合渐进为基础提出的。