Attribution methods have been developed to understand the decision-making process of machine learning models, especially deep neural networks, by assigning importance scores to individual features. Existing attribution methods often built upon empirical intuitions and heuristics. There still lacks a general and theoretical framework that not only can unify these attribution methods, but also theoretically reveal their rationales, fidelity, and limitations. To bridge the gap, in this paper, we propose a Taylor attribution framework and reformulate seven mainstream attribution methods into the framework. Based on reformulations, we analyze the attribution methods in terms of rationale, fidelity, and limitation. Moreover, We establish three principles for a good attribution in the Taylor attribution framework, i.e., low approximation error, correct contribution assignment, and unbiased baseline selection. Finally, we empirically validate the Taylor reformulations and reveal a positive correlation between the attribution performance and the number of principles followed by the attribution method via benchmarking on real-world datasets.
翻译:为了理解机器学习模式,特别是深神经网络的决策过程,已经开发了归属方法,以了解机器学习模式,特别是深神经网络的决策进程,将重要分数分配给个人特点。现有的归属方法往往基于经验直觉和累进论。仍然缺乏一个不仅能够统一这些归属方法,而且在理论上也能够揭示其理由、忠诚和局限性的一般性和理论框架。为了缩小差距,我们在本文件中提出泰勒归属框架,并将七个主流归属方法重新纳入框架。根据重新拟订,我们从理论、忠诚和限制的角度分析归属方法。此外,我们为泰勒归属框架的良好归属确定了三项原则,即低近似错误、正确贡献分配和不偏向基线选择。最后,我们根据经验验证泰勒重新拟订的归属业绩,并通过真实世界数据集的基准,揭示归属方法遵循的原则数目之间的积极关联。