While recent years have witnessed the emergence of various explainable methods in machine learning, to what degree the explanations really represent the reasoning process behind the model prediction -- namely, the faithfulness of explanation -- is still an open problem. One commonly used way to measure faithfulness is \textit{erasure-based} criteria. Though conceptually simple, erasure-based criterion could inevitably introduce biases and artifacts. We propose a new methodology to evaluate the faithfulness of explanations from the \textit{counterfactual reasoning} perspective: the model should produce substantially different outputs for the original input and its corresponding counterfactual edited on a faithful feature. Specially, we introduce two algorithms to find the proper counterfactuals in both discrete and continuous scenarios and then use the acquired counterfactuals to measure faithfulness. Empirical results on several datasets show that compared with existing metrics, our proposed counterfactual evaluation method can achieve top correlation with the ground truth under diffe
翻译:虽然近年来在机器学习中出现了各种可解释的方法,但解释在多大程度上真正代表了模型预测背后的推理过程 -- -- 即解释的忠实性 -- -- 仍然是一个尚未解决的问题。衡量忠诚程度的一种常用方法就是 \ textit{erasure-basure basure} 标准。虽然概念简单,但基于删除的标准可能不可避免地引入偏见和人工制品。我们提出了一个新方法,从\ textit{counterfactal 推理}的角度来评价解释解释的准确性:模型应该为原始输入及其对应的反事实产生大不相同的产出,并按忠实特征编辑。特别是,我们引入两种算法,在离散和连续的情景中找到适当的反事实,然后使用获得的反事实来衡量忠诚程度。几个数据集的实证结果显示,与现有的指标相比,我们提议的反事实评价方法可以实现与地面真相的最大关联。