Counterfactual explanations and adversarial examples have emerged as critical research areas for addressing the explainability and robustness goals of machine learning (ML). While counterfactual explanations were developed with the goal of providing recourse to individuals adversely impacted by algorithmic decisions, adversarial examples were designed to expose the vulnerabilities of ML models. While prior research has hinted at the commonalities between these frameworks, there has been little to no work on systematically exploring the connections between the literature on counterfactual explanations and adversarial examples. In this work, we make one of the first attempts at formalizing the connections between counterfactual explanations and adversarial examples. More specifically, we theoretically analyze salient counterfactual explanation and adversarial example generation methods, and highlight the conditions under which they behave similarly. Our analysis demonstrates that several popular counterfactual explanation and adversarial example generation methods such as the ones proposed by Wachter et. al. and Carlini and Wagner (with mean squared error loss), and C-CHVAE and natural adversarial examples by Zhao et. al. are equivalent. We also bound the distance between counterfactual explanations and adversarial examples generated by Wachter et. al. and DeepFool methods for linear models. Finally, we empirically validate our theoretical findings using extensive experimentation with synthetic and real world datasets.
翻译:反事实解释和对抗性实例是解决机器学习的可解释性和稳健性目标的关键研究领域。虽然制定了反事实解释,目的是向受到算法决定不利影响的个人提供求助手段,但旨在暴露多边学习模式脆弱性的对抗性实例。虽然先前的研究暗示了这些框架之间的共性,但在系统探讨反事实解释文献与对抗性实例文献之间的联系方面几乎没有做任何工作。在这项工作中,我们首先尝试将反事实解释与对抗性实例之间的联系正规化。更具体地说,我们从理论上分析反事实解释和对抗性生成范例方法,并突出其类似行为的条件。我们的分析表明,一些流行的反事实解释和对抗性实例生成方法,如Wachter等人以及Carlini和Wagner提出的那些方法(平均正方差损失),以及Zhao等人提出的C-CHVAE和自然对抗性实例,是等同的。我们还把反事实解释与Wachter公司使用真实的实验模型和深层实验结果产生的对抗性实例与我们使用全球最终实验和深层实验结果的理论模型之间的距离。