Most existing evaluations of explainable machine learning (ML) methods rely on simplifying assumptions or proxies that do not reflect real-world use cases; the handful of more robust evaluations on real-world settings have shortcomings in their design, resulting in limited conclusions of methods' real-world utility. In this work, we seek to bridge this gap by conducting a study that evaluates three popular explainable ML methods in a setting consistent with the intended deployment context. We build on a previous study on e-commerce fraud detection and make crucial modifications to its setup relaxing the simplifying assumptions made in the original work that departed from the deployment context. In doing so, we draw drastically different conclusions from the earlier work and find no evidence for the incremental utility of the tested methods in the task. Our results highlight how seemingly trivial experimental design choices can yield misleading conclusions, with lessons about the necessity of not only evaluating explainable ML methods using tasks, data, users, and metrics grounded in the intended deployment contexts but also developing methods tailored to specific applications. In addition, we believe the design of this experiment can serve as a template for future study designs evaluating explainable ML methods in other real-world contexts.
翻译:对可解释的机器学习(ML)方法的现有评价大多依靠不反映现实世界使用案例的简化假设或代理;少数关于现实世界环境的更强有力的评价在设计上存在缺陷,导致对方法的实际效用作出有限的结论;在这项工作中,我们力求弥合这一差距,进行一项研究,在与预定部署环境一致的环境下评价三种普遍可解释的ML方法;我们以先前关于电子商务欺诈探测的研究为基础,对其设置作了重大修改,放松了从部署背景出发的最初工作中所作的简化假设;为此,我们从先前的工作中得出了截然不同的结论,没有找到证据证明所测试的方法在任务中日益有用。我们的结果突出表明,似乎微不足道的实验性设计选择如何会产生误导性结论,并吸取经验教训,说明不仅需要评估使用任务、数据、用户和基于预定部署环境的可解释的ML方法,而且还需要制定适合具体应用的方法。此外,我们认为,这一试验的设计可以作为今后研究的模板,用以评价其他现实环境中可解释的ML方法。