Counterfactual examples for an input -- perturbations that change specific features but not others -- have been shown to be useful for evaluating bias of machine learning models, e.g., against specific demographic groups. However, generating counterfactual examples for images is non-trivial due to the underlying causal structure on the various features of an image. To be meaningful, generated perturbations need to satisfy constraints implied by the causal model. We present a method for generating counterfactuals by incorporating a structural causal model (SCM) in an improved variant of Adversarially Learned Inference (ALI), that generates counterfactuals in accordance with the causal relationships between attributes of an image. Based on the generated counterfactuals, we show how to explain a pre-trained machine learning classifier, evaluate its bias, and mitigate the bias using a counterfactual regularizer. On the Morpho-MNIST dataset, our method generates counterfactuals comparable in quality to prior work on SCM-based counterfactuals (DeepSCM), while on the more complex CelebA dataset our method outperforms DeepSCM in generating high-quality valid counterfactuals. Moreover, generated counterfactuals are indistinguishable from reconstructed images in a human evaluation experiment and we subsequently use them to evaluate the fairness of a standard classifier trained on CelebA data. We show that the classifier is biased w.r.t. skin and hair color, and how counterfactual regularization can remove those biases.
翻译:输入的反事实例子 -- -- 改变特定特征而非其他特征的干扰性实例 -- -- 已证明有助于评价机器学习模型的偏差,例如针对特定人口群体的偏差。然而,为图像生成反事实例子并非三重性,因为图像各种特征的内在因果结构。要有意义,产生扰动需要满足因果模型所隐含的限制。我们提出了一个产生反事实的方法,在改进的反事实变异中加入了结构性因果模型(SCM),根据图像属性之间的因果关系产生反事实。根据产生的反事实,我们展示了如何解释一个经过预先训练的机器学习分类,评价其偏差,并用反事实调节器减轻偏差。在Morpho-MNIST数据集中,我们的方法产生反事实,在质量上与以前关于基于SCM的反事实(DeepSCM)的工作相当,而在更复杂的CelebA数据中根据我们的方法超越了图像的因果关系关系。我们展示了深层-Screalal-Ideal的反向数据,在随后产生高品质的对结果的反分析中,我们可以进行反结果的对等的对等数据。