To understand the black-box characteristics of deep networks, counterfactual explanation that deduces not only the important features of an input space but also how those features should be modified to classify input as a target class has gained an increasing interest. The patterns that deep networks have learned from a training dataset can be grasped by observing the feature variation among various classes. However, current approaches perform the feature modification to increase the classification probability for the target class irrespective of the internal characteristics of deep networks. This often leads to unclear explanations that deviate from real-world data distributions. To address this problem, we propose a counterfactual explanation method that exploits the statistics learned from a training dataset. Especially, we gradually construct an explanation by iterating over masking and composition steps. The masking step aims to select an important feature from the input data to be classified as a target class. Meanwhile, the composition step aims to optimize the previously selected feature by ensuring that its output score is close to the logit space of the training data that are classified as the target class. Experimental results show that our method produces human-friendly interpretations on various classification datasets and verify that such interpretations can be achieved with fewer feature modification.
翻译:为了理解深层网络的黑盒特性,不仅推断出输入空间的重要特征,而且说明应如何修改这些特征,将输入分类为目标类的反事实解释越来越引起人们的兴趣。深层网络从培训数据集中学习的模式可以通过观察不同类别之间的特征差异来理解。然而,目前的方法进行特征修改,以提高目标类的分类概率,而不管深层网络的内部特征如何。这往往导致与真实世界数据分布不同的解释不明确。为了解决这一问题,我们提出了一个利用从培训数据集中获取的统计数据的反事实解释方法。特别是,我们逐渐通过对掩码和组成步骤进行重复来构建一个解释。掩码步骤的目的是从输入数据中选择一个重要特征,将其分类为目标类。与此同时,组成步骤的目标是优化先前选择的特征,确保其输出分数接近列为目标类的培训数据的逻辑空间。实验结果表明,我们的方法在各种分类数据集中产生对人友好的解释,并核实这些解释可以用较少的特性修改来完成。