Machine learning plays a role in many deployed decision systems, often in ways that are difficult or impossible to understand by human stakeholders. Explaining, in a human-understandable way, the relationship between the input and output of machine learning models is essential to the development of trustworthy machine-learning-based systems. A burgeoning body of research seeks to define the goals and methods of explainability in machine learning. In this paper, we seek to review and categorize research on counterfactual explanations, a specific class of explanation that provides a link between what could have happened had input to a model been changed in a particular way. Modern approaches to counterfactual explainability in machine learning draw connections to the established legal doctrine in many countries, making them appealing to fielded systems in high-impact areas such as finance and healthcare. Thus, we design a rubric with desirable properties of counterfactual explanation algorithms and comprehensively evaluate all currently-proposed algorithms against that rubric. Our rubric provides easy comparison and comprehension of the advantages and disadvantages of different approaches and serves as an introduction to major research themes in this field. We also identify gaps and discuss promising research directions in the space of counterfactual explainability.
翻译:在许多已部署的决策系统中,机器学习往往以难以理解或无法为人类利益攸关方理解的方式发挥作用。用人理解的方式解释机器学习模式的投入和产出之间的关系,对于建立可靠的机器学习系统至关重要。一大批新的研究试图确定机器学习的目标和解释方法。在本文件中,我们试图审查和分类反事实解释研究,这是在可能发生的情况之间建立联系的一种具体解释类别,对模型的投入特别有所改变。机器学习中反事实解释的现代方法与许多国家的既定法律理论相连接,使它们对影响较大的领域,如金融和保健等领域的外地系统有吸引力。因此,我们设计了一个带有反事实解释算法的适宜特性的图案,并全面评价目前针对这一图案的所有拟议算法。我们的图解为比较和理解不同方法的利弊提供了方便,并成为该领域主要研究主题的导言。我们还找出差距,并讨论反事实解释空间中有希望的研究方向。