Interpretable machine learning offers insights into what factors drive a certain prediction of a black-box system. A large number of interpreting methods focus on selecting explanatory input features, which follow either additive or instance-wise directions. Additive methods exploit local neighborhoods to learn instance-specific explainers sequentially. The process is thus inefficient and susceptible to poorly-conditioned samples. Meanwhile, instance-wise methods directly optimize local feature distributions in a global training framework, thereby being capable of leveraging global information from other inputs. However, they can only interpret single-class predictions and suffer from inconsistency across different settings, due to a strict reliance on a pre-defined number of features selected. This work exploits the strengths of both methods and proposes a framework for learning local explanations simultaneously for multiple target classes. Our model explainer significantly outperforms additive and instance-wise counterparts on faithfulness with more compact and comprehensible explanations. We also demonstrate the capacity to select stable and important features through extensive experiments on various data sets and black-box model architectures.
翻译:可解释的机器学习有助于深入了解哪些因素促使对黑盒系统作出某种预测。许多解释方法侧重于选择解释性输入特征,这些特征要么遵循添加或实例化的方向。添加方法利用当地邻居按顺序学习具体实例的解释者。因此,这一过程效率低下,容易出现条件差的样本。与此同时,实例方法直接优化全球培训框架中的本地特征分布,从而能够从其他投入中利用全球信息。然而,它们只能解释单级预测,并因严格依赖预设的选定特征数量而在不同环境中出现不一致。这项工作利用了两种方法的优势,并为多个目标类同时学习本地解释提出了框架。我们的模型解释明显优于添加和实例化的对应方,以更紧凑和易理解的解释。我们还展示了通过各种数据集和黑盒模型架构的广泛实验选择稳定和重要特征的能力。