We propose a methodology that systematically applies deep explanation algorithms on a dataset-wide basis, to compare different types of visual recognition backbones, such as convolutional networks (CNNs), global attention networks, and local attention networks. Examination of both qualitative visualizations and quantitative statistics across the dataset helps us to gain intuitions that are not just anecdotal, but are supported by the statistics computed on the entire dataset. Specifically, we propose two methods. The first one, sub-explanation counting, systematically searches for minimally-sufficient explanations of all images and count the amount of sub-explanations for each network. The second one, called cross-testing, computes salient regions using one network and then evaluates the performance by only showing these regions as an image to other networks. Through a combination of qualitative insights and quantitative statistics, we illustrate that 1) there are significant differences between the salient features of CNNs and attention models; 2) the occlusion-robustness in local attention models and global attention models may come from different decision-making mechanisms.
翻译:我们建议了一种方法,在全数据集的基础上系统地应用深度解释算法,以比较不同类型的视觉识别主干,如革命网络、全球关注网络和地方关注网络。审查数据集的定性可视化和定量统计数据有助于我们获得不仅传闻性的直觉,而且得到整个数据集计算的统计数据的支持。具体地说,我们建议了两种方法。第一个方法,即分解计算,系统搜索所有图像的最低限度充分解释,并计算每个网络的子勘测数量。第二个方法,即交叉测试,利用一个网络计算突出区域,然后通过将这些地区作为其他网络的图像来评估业绩。我们通过将定性洞察和定量统计数据结合起来,说明:(1)CNN的显著特征与关注模型之间有很大差异;(2)地方关注模型和全球关注模式的排斥-恶化可能来自不同的决策机制。