Neural network visualization techniques mark image locations by their relevancy to the network's classification. Existing methods are effective in highlighting the regions that affect the resulting classification the most. However, as we show, these methods are limited in their ability to identify the support for alternative classifications, an effect we name {\em the saliency bias} hypothesis. In this work, we integrate two lines of research: gradient-based methods and attribution-based methods, and develop an algorithm that provides per-class explainability. The algorithm back-projects the per pixel local influence, in a manner that is guided by the local attributions, while correcting for salient features that would otherwise bias the explanation. In an extensive battery of experiments, we demonstrate the ability of our methods to class-specific visualization, and not just the predicted label. Remarkably, the method obtains state of the art results in benchmarks that are commonly applied to gradient-based methods as well as in those that are employed mostly for evaluating attribution methods. Using a new unsupervised procedure, our method is also successful in demonstrating that self-supervised methods learn semantic information.
翻译:神经网络可视化技术通过与网络分类的相关性来标记图像位置。 现有方法在突出影响最终分类最强的区域方面是有效的。 然而, 正如我们所显示的, 这些方法在确定替代分类支持能力方面是有限的, 我们称之为显著偏差假设。 在这项工作中, 我们整合了两条研究线: 梯度法和归因法, 并开发了一种提供每类解释性的算法。 算法以本地特性为指导, 以本地特性为指南, 并纠正本会偏向解释的突出特征。 在一系列广泛的实验中, 我们展示了我们的方法对特定分类的可视化, 而不仅仅是预测的标签。 值得注意的是, 这种方法在通常适用于梯度法的基准以及主要用于评价归属方法的基准中获得了艺术结果的状态。 使用新的非监督程序, 我们的方法还成功地展示了自我控制的方法学习语义信息。