There have been several research works proposing new Explainable AI (XAI) methods designed to generate model explanations having specific properties, or desiderata, such as fidelity, robustness, or human-interpretability. However, explanations are seldom evaluated based on their true practical impact on decision-making tasks. Without that assessment, explanations might be chosen that, in fact, hurt the overall performance of the combined system of ML model + end-users. This study aims to bridge this gap by proposing XAI Test, an application-grounded evaluation methodology tailored to isolate the impact of providing the end-user with different levels of information. We conducted an experiment following XAI Test to evaluate three popular post-hoc explanation methods -- LIME, SHAP, and TreeInterpreter -- on a real-world fraud detection task, with real data, a deployed ML model, and fraud analysts. During the experiment, we gradually increased the information provided to the fraud analysts in three stages: Data Only, i.e., just transaction data without access to model score nor explanations, Data + ML Model Score, and Data + ML Model Score + Explanations. Using strong statistical analysis, we show that, in general, these popular explainers have a worse impact than desired. Some of the conclusion highlights include: i) showing Data Only results in the highest decision accuracy and the slowest decision time among all variants tested, ii) all the explainers improve accuracy over the Data + ML Model Score variant but still result in lower accuracy when compared with Data Only; iii) LIME was the least preferred by users, probably due to its substantially lower variability of explanations from case to case.
翻译:已有若干研究工作提出新的可解释的AI(XAI)方法,旨在产生具有具体属性或偏差的模型解释,如忠诚、稳健或人的解释性。然而,解释很少根据其对决策任务的实际实际影响来评估。如果没有这种评估,可能会选择损害ML模型+最终用户综合系统总体性能的解释。这项研究的目的是缩小这一差距,办法是提出XAI测试,这是一种应用的基于成本的评估方法,专门用来分离向最终用户提供不同水平信息的影响。我们在XAI测试之后进行了一项实验,以评价三种流行的热后解释方法 -- -- LIME、SHAP和Trea Interpreter -- -- 对真实世界的欺诈检测任务、真实数据、已部署的ML模型模型模型和欺诈分析员。在试验期间,我们逐步增加了向欺诈分析员提供的三个阶段的信息:数据仅包括数据,即仅通过不易得分数或解释的通用变量,只有交易数据+ML模型评分数,以及数据+ML模型评分分数和数据+ML模型评分数最差的数据评分数 -- -- -- 将数据评为最差的数据评分数。在统计分析中,在一般分析中显示数据分析结果中可能显示数据结果时,将数据评为最差的结果。