Locally interpretable model agnostic explanations (LIME) method is one of the most popular methods used to explain black-box models at a per example level. Although many variants have been proposed, few provide a simple way to produce high fidelity explanations that are also stable and intuitive. In this work, we provide a novel perspective by proposing a model agnostic local explanation method inspired by the invariant risk minimization (IRM) principle -- originally proposed for (global) out-of-distribution generalization -- to provide such high fidelity explanations that are also stable and unidirectional across nearby examples. Our method is based on a game theoretic formulation where we theoretically show that our approach has a strong tendency to eliminate features where the gradient of the black-box function abruptly changes sign in the locality of the example we want to explain, while in other cases it is more careful and will choose a more conservative (feature) attribution, a behavior which can be highly desirable for recourse. Empirically, we show on tabular, image and text data that the quality of our explanations with neighborhoods formed using random perturbations are much better than LIME and in some cases even comparable to other methods that use realistic neighbors sampled from the data manifold. This is desirable given that learning a manifold to either create realistic neighbors or to project explanations is typically expensive or may even be impossible. Moreover, our algorithm is simple and efficient to train, and can ascertain stable input features for local decisions of a black-box without access to side information such as a (partial) causal graph as has been seen in some recent works.
翻译:本地可解释的模型不可知解释(LIME)方法,是用来解释每个示例级黑盒模型的最受欢迎的方法之一。虽然提出了许多变量,但很少能提供一个简单的方法来产生高度忠诚的解释,这些解释也是稳定和直观的。在这项工作中,我们提供了一个新视角,提出一个模型不可知的本地解释方法,其灵感来自变化风险最小化(IRM)原则 -- -- 最初是为(全球)分配范围外一般化提出的 -- -- 以提供如此高的准确性解释,这些解释在附近的例子中也是稳定的和单向的。我们的方法基于一种游戏的理论表达方式,我们从理论上表明我们的方法有很强的倾向去消除黑盒功能梯度在我们想要解释的地方突然改变标志的特征,而在其他情况下,它可能更加小心,并且会选择一种更保守的(相对性)归属,一种非常适合追索的行为。我们从表格、图像和文字上看到一些数据,我们与社区的解释的质量,在不具有现实性、不直观性、不精确的状态的表达方式上,我们的方法通常会通过随机性、不精确的深度的样样样式的模型到模型,从而从模型到更能地学习数据,从而可以更精确地学习更精确地进行。