Over the last decade, Convolutional Neural Network (CNN) models have been highly successful in solving complex vision based problems. However, these deep models are perceived as "black box" methods considering the lack of understanding of their internal functioning. There has been a significant recent interest to develop explainable deep learning models, and this paper is an effort in this direction. Building on a recently proposed method called Grad-CAM, we propose a generalized method called Grad-CAM++ that can provide better visual explanations of CNN model predictions, in terms of better object localization as well as explaining occurrences of multiple object instances in a single image, when compared to state-of-the-art. We provide a mathematical derivation for the proposed method, which uses a weighted combination of the positive partial derivatives of the last convolutional layer feature maps with respect to a specific class score as weights to generate a visual explanation for the corresponding class label. Our extensive experiments and evaluations, both subjective and objective, on standard datasets showed that Grad-CAM++ provides promising human-interpretable visual explanations for a given CNN architecture across multiple tasks including classification, image caption generation and 3D action recognition; as well as in new settings such as knowledge distillation.
翻译:过去十年来,进化神经网络(CNN)模型在解决基于视觉的复杂问题方面非常成功。然而,这些深层次模型被视为“黑盒子”方法,因为缺乏对内部功能的了解。最近人们非常有兴趣开发可以解释的深层学习模式,本文是朝这个方向努力。根据最近提出的称为格拉德-卡姆(Grad-CAM)的方法,我们提出了一个称为格拉德-CAM++的通用方法,它可以提供更清晰的CNN模型预测的视觉解释,从更好的对象本地化角度以及解释单一图像中多个对象实例的发生,与最新状态相比。我们为拟议的方法提供了数学推算,该方法使用对上层革命层特征图的正面部分衍生物的加权组合,作为某一类的权重,为相应的类标签提供直观解释。我们对标准数据集的主观和客观的广泛实验和评价表明,格拉德-CAM+为在多种任务中给定的CNN结构提供了充满希望的、可互换的视觉解释,包括分类、图像生成和3D行动确认,作为知识的分化。