可解释的人工情报中的反事实和因果关系:理论、等级和应用 (Counterfactuals and Causability in Explainable Artificial Intelligence: Theory, Algorithms, and Applications)

There has been a growing interest in model-agnostic methods that can make deep learning models more transparent and explainable to a user. Some researchers recently argued that for a machine to achieve a certain degree of human-level explainability, this machine needs to provide human causally understandable explanations, also known as causability. A specific class of algorithms that have the potential to provide causability are counterfactuals. This paper presents an in-depth systematic review of the diverse existing body of literature on counterfactuals and causability for explainable artificial intelligence. We performed an LDA topic modelling analysis under a PRISMA framework to find the most relevant literature articles. This analysis resulted in a novel taxonomy that considers the grounding theories of the surveyed algorithms, together with their underlying properties and applications in real-world data. This research suggests that current model-agnostic counterfactual algorithms for explainable AI are not grounded on a causal theoretical formalism and, consequently, cannot promote causability to a human decision-maker. Our findings suggest that the explanations derived from major algorithms in the literature provide spurious correlations rather than cause/effects relationships, leading to sub-optimal, erroneous or even biased explanations. This paper also advances the literature with new directions and challenges on promoting causability in model-agnostic approaches for explainable artificial intelligence.

翻译：人们日益关注能够使深层次学习模型更加透明和可以向用户解释的模型-不可知性方法。最近一些研究人员认为,机器要达到某种程度的人类层面解释,机器就必须提供人因果可理解的解释,也称为因果关系的解释。一种特定的算法有可能提供因果关系,这是反事实。本文对关于反事实和可解释人工智能的可解释性的各种现有文献进行了深入的系统审查。我们在PRISMA框架内进行了LDA专题建模分析,以找到最相关的文献文章。这一分析产生了一种新颖的分类学,它考虑到被调查算法的基础理论及其内在属性和真实世界数据的应用。这一研究表明,目前用于解释可解释性人工智能的模型-不可知反事实算法并非基于因果关系的理论形式论,因此无法促进人类决策者的可解释性。我们的调查结果表明,从文献中主要算法中得出的解释,甚至提供了可推断性的关联性,而不是根据因果关系/后果解释,从而导致在错误的论文上推动偏向性的文件的分级解释。