Counterfactual examples are one of the most commonly-cited methods for explaining the predictions of machine learning models in key areas such as finance and medical diagnosis. Counterfactuals are often discussed under the assumption that the model on which they will be used is static, but in deployment models may be periodically retrained or fine-tuned. This paper studies the consistency of model prediction on counterfactual examples in deep networks under small changes to initial training conditions, such as weight initialization and leave-one-out variations in data, as often occurs during model deployment. We demonstrate experimentally that counterfactual examples for deep models are often inconsistent across such small changes, and that increasing the cost of the counterfactual, a stability-enhancing mitigation suggested by prior work in the context of simpler models, is not a reliable heuristic in deep networks. Rather, our analysis shows that a model's local Lipschitz continuity around the counterfactual is key to its consistency across related models. To this end, we propose Stable Neighbor Search as a way to generate more consistent counterfactual explanations, and illustrate the effectiveness of this approach on several benchmark datasets.
翻译:反事实实例是解释融资和医学诊断等关键领域机器学习模型预测的最常见方法之一。反事实往往在以下假设下讨论:将使用这些模型的模式是静态的,但在部署模型中可以定期重新培训或微调。本文研究了在初始培训条件变化小的情况下深网络反事实实例模型预测的一致性,这些变化在模型部署期间经常发生。我们实验性地表明,深层次模型的反事实实例在这种小变化中往往不一致,而提高反事实成本,即先前在更简单模型中开展的工作所建议的增强稳定性,在深层网络中不是一个可靠的超常现象。相反,我们的分析表明,模型的局部Lipschitz在初始培训条件下的连续性是其在有关模型之间一致性的关键。为此,我们建议Stagl Neighbor搜索,作为产生更一致的反事实解释的一种方法,并表明这一方法在几个基准数据集上的有效性。