Researchers have proposed a wide variety of model explanation approaches, but it remains unclear how most methods are related or when one method is preferable to another. We examine the literature and find that many methods are based on a shared principle of explaining by removing - essentially, measuring the impact of removing sets of features from a model. These methods vary in several respects, so we develop a framework for removal-based explanations that characterizes each method along three dimensions: 1) how the method removes features, 2) what model behavior the method explains, and 3) how the method summarizes each feature's influence. Our framework unifies 26 existing methods, including several of the most widely used approaches (SHAP, LIME, Meaningful Perturbations, permutation tests). Exposing the fundamental similarities between these methods empowers users to reason about which tools to use, and suggests promising directions for ongoing model explainability research.
翻译:研究人员提出了多种多样的示范解释方法,但是仍然不清楚大多数方法是相互关联的,还是一种方法比另一种方法更可取。我们研究了文献,发现许多方法基于一个共同的解释原则,即删除 -- -- 基本上衡量从模型中移走成套特征的影响。这些方法在几个方面各不相同,因此我们开发了一个基于清除的解释框架,从三个方面来描述每种方法的特征:(1) 方法如何消除特征,(2) 方法的示范行为如何解释,(3) 方法如何概括每个特征的影响。我们的框架将26种现有方法统一起来,包括一些最广泛使用的方法(SHAP、LIME、有意义的扰动、调和试验)。这些方法之间的基本相似之处使用户能够说明使用哪些工具,并为正在进行的模型解释研究提出有希望的方向。