Researchers have proposed a wide variety of model explanation approaches, but it remains unclear how most methods are related or when one method is preferable to another. We describe a new unified class of methods, removal-based explanations, that are based on the principle of simulating feature removal to quantify each feature's influence. These methods vary in several respects, so we develop a framework that characterizes each method along three dimensions: 1) how the method removes features, 2) what model behavior the method explains, and 3) how the method summarizes each feature's influence. Our framework unifies 26 existing methods, including several of the most widely used approaches: SHAP, LIME, Meaningful Perturbations, and permutation tests. This newly understood class of explanation methods has rich connections that we examine using tools that have been largely overlooked by the explainability literature. To anchor removal-based explanations in cognitive psychology, we show that feature removal is a simple application of subtractive counterfactual reasoning. Ideas from cooperative game theory shed light on the relationships and trade-offs among different methods, and we derive conditions under which all removal-based explanations have information-theoretic interpretations. Through this analysis, we develop a unified framework that helps practitioners better understand model explanation tools, and that offers a strong theoretical foundation upon which future explainability research can build.
翻译:研究人员提出了多种多样的示范解释方法,但目前还不清楚大多数方法是如何关联的,还是一种方法比另一种方法更可取。我们描述了一种新的统一方法类别、基于清除的解释,这些方法基于模拟特性去除的原则,以量化每个特性的影响。这些方法在几个方面各不相同,因此我们制定了一个框架,将每种方法分为三个方面:(1) 方法如何消除特征,(2) 方法解释什么模式行为,(3) 方法如何概括每个特性的影响。我们的框架将26种现有方法统一起来,包括一些最广泛使用的方法:SHAP、LME、有意义的扰动和变异测试。这种新理解的解释方法有丰富的联系,我们利用这些被解释性文献基本上忽视的工具来加以研究。为了将基于清除的解释建立在认知性心理学中,我们表明,去除特征是一种简单的应用,是减少反事实推理的简单应用。合作游戏理论揭示了不同方法之间的关系和权衡,我们从中得出一些条件,所有基于清除的解释都有更好的信息-理论解释。通过这一分析,我们制定一个统一的框架可以帮助构建一个更牢固的理论基础。