示范解释的矛盾解释 (Contrastive Explanations for Model Interpretability)

Contrastive explanations clarify why an event occurred in contrast to another. They are more inherently intuitive to humans to both produce and comprehend. We propose a methodology to produce contrastive explanations for classification models by modifying the representation to disregard non-contrastive information, and modifying model behavior to only be based on contrastive reasoning. Our method is based on projecting model representation to a latent space that captures only the features that are useful (to the model) to differentiate two potential decisions. We demonstrate the value of contrastive explanations by analyzing two different scenarios, using both high-level abstract concept attribution and low-level input token/span attribution, on two widely used text classification tasks. Specifically, we produce explanations for answering: for which label, and against which alternative label, is some aspect of the input useful? And which aspects of the input are useful for and against particular decisions? Overall, our findings shed light on the ability of label-contrastive explanations to provide a more accurate and finer-grained interpretability of a model's decision.

翻译：反之, 反之亦然。它们对于人类来说更具有内在的直觉性, 既产生又理解。我们提出一种方法,通过修改表达方式, 忽略非争议性信息, 修改模型行为, 将模型行为只以对比推理为基础, 来为分类模式提供对比性解释。我们的方法是基于将模型代表方式投射到一个潜在空间, 只捕捉两种潜在决定的有用特征( 对模型而言 ) 。我们用两种广泛使用的文本分类任务, 分析两种不同假设, 使用高层次的抽象概念属性和低层次输入符号/span 归属, 来显示对比性解释的价值。具体地说, 我们提出答案的解释: 对于哪些标签, 和哪些替代标签是投入的某些有用方面? 投入的哪些方面对特定决定有用? 总体而言, 我们的调查结果揭示了标签- 调性解释对于模型决定提供更准确、更精确、更精确、更精确的解释的能力。