The most popular methods in AI-machine learning paradigm are mainly black boxes. This is why explanation of AI decisions is of emergency. Although dedicated explanation tools have been massively developed, the evaluation of their quality remains an open research question. In this paper, we generalize the methodologies of evaluation of post-hoc explainers of CNNs' decisions in visual classification tasks with reference and no-reference based metrics. We apply them on our previously developed explainers (FEM, MLFEM), and popular Grad-CAM. The reference-based metrics are Pearson correlation coefficient and Similarity computed between the explanation map and its ground truth represented by a Gaze Fixation Density Map obtained with a psycho-visual experiment. As a no-reference metric, we use stability metric, proposed by Alvarez-Melis and Jaakkola. We study its behaviour, consensus with reference-based metrics and show that in case of several kinds of degradation on input images, this metric is in agreement with reference-based ones. Therefore, it can be used for evaluation of the quality of explainers when the ground truth is not available.
翻译:AI-Mach学习模式中最受欢迎的方法主要是黑盒。这就是为什么解释AI决定是紧急的。尽管专门的解释工具已经大规模开发,但其质量评价仍然是一个开放的研究问题。在本文中,我们以参考和无参考基准的衡量标准,推广了CNN决定视觉分类任务后热解解释器的评价方法。我们将其应用于我们以前开发的解释器(FEM、MLFEM)和广受欢迎的 Grad-CAM。基于参考的衡量标准是Pearson相关系数和以心理-视觉实验获得的Gaze 固定密度地图所显示的解释地图及其地面真相之间的相似性。作为一个不参考指标,我们使用Alvarez-Melis和Jaakkola提出的稳定性指标。我们研究其行为,与基于参考的衡量标准达成共识,并表明在输入图像出现几种退化的情况下,该指标与基于参考的数据一致。因此,在无法获得地面真相时,可以用来评价解释者的质量。