我们需要另一种可解释的AI方法吗? (Do We Need Another Explainable AI Method? Toward Unifying Post-hoc XAI Evaluation Methods into an Interactive and Multi-dimensional Benchmark)

In recent years, Explainable AI (xAI) attracted a lot of attention as various countries turned explanations into a legal right. xAI allows for improving models beyond the accuracy metric by, e.g., debugging the learned pattern and demystifying the AI's behavior. The widespread use of xAI brought new challenges. On the one hand, the number of published xAI algorithms underwent a boom, and it became difficult for practitioners to select the right tool. On the other hand, some experiments did highlight how easy data scientists could misuse xAI algorithms and misinterpret their results. To tackle the issue of comparing and correctly using feature importance xAI algorithms, we propose Compare-xAI, a benchmark that unifies all exclusive functional testing methods applied to xAI algorithms. We propose a selection protocol to shortlist non-redundant functional tests from the literature, i.e., each targeting a specific end-user requirement in explaining a model. The benchmark encapsulates the complexity of evaluating xAI methods into a hierarchical scoring of three levels, namely, targeting three end-user groups: researchers, practitioners, and laymen in xAI. The most detailed level provides one score per test. The second level regroups tests into five categories (fidelity, fragility, stability, simplicity, and stress tests). The last level is the aggregated comprehensibility score, which encapsulates the ease of correctly interpreting the algorithm's output in one easy to compare value. Compare-xAI's interactive user interface helps mitigate errors in interpreting xAI results by quickly listing the recommended xAI solutions for each ML task and their current limitations. The benchmark is made available at https://karim-53.github.io/cxai/

翻译：近年来,可解释的AI (xAI) 吸引了人们的极大关注,因为许多国家将解释的解释转化为法律权利。xAI 使得模型的改进超越了精确度衡量的精确度,例如,通过对所学模式进行调调调,并去掉AI的行为;广泛使用 xAI 带来了新的挑战。一方面,出版的 xAI 算法的数量经历了一个繁荣,执业者很难选择正确的工具。另一方面,一些实验的确突出了数据科学家如何轻易地可以滥用xAI 算法,并曲解其结果。为了解决使用特征重要性xAI 算法比较和正确的问题,我们建议AI AI 将改进超出准确度衡量的模型。xAI 的模型允许通过,例如,通过调试所出版的TAAI 行为的广泛使用。一方面,已出版的xAI 算法的数值数量,包括:研究人员、开业者、开业者、在xAI 级中放置的简化和正确性等三个最终用户群体,即:研究人员、开业者、在xAI 基准级中,以三个最终用户群体为比较问题比较问题比较和在xAI 基准中,最详细的一级,最详细的一级为稳定测试提供一个稳定测试测试。