The prevalence of employing attention mechanisms has brought along concerns on the interpretability of attention distributions. Although it provides insights about how a model is operating, utilizing attention as the explanation of model predictions is still highly dubious. The community is still seeking more interpretable strategies for better identifying local active regions that contribute the most to the final decision. To improve the interpretability of existing attention models, we propose a novel Bilinear Representative Non-Parametric Attention (BR-NPA) strategy that captures the task-relevant human-interpretable information. The target model is first distilled to have higher-resolution intermediate feature maps. From which, representative features are then grouped based on local pairwise feature similarity, to produce finer-grained, more precise attention maps highlighting task-relevant parts of the input. The obtained attention maps are ranked according to the `active level' of the compound feature, which provides information regarding the important level of the highlighted regions. The proposed model can be easily adapted in a wide variety of modern deep models, where classification is involved. It is also more accurate, faster, and with a smaller memory footprint than usual neural attention modules. Extensive experiments showcase more comprehensive visual explanations compared to the state-of-the-art visualization model across multiple tasks including few-shot classification, person re-identification, fine-grained image classification. The proposed visualization model sheds imperative light on how neural networks `pay their attention' differently in different tasks.
翻译:利用关注机制的普遍程度使人们对关注分布的可解释性产生了关切,虽然它使人们对模型的运作方式有了深刻的认识,但利用模型预测的解释仍然非常可疑;社区仍在寻求更解释性更强的战略,以更好地确定对最终决定贡献最大的地方活跃区域;为了改进现有关注模式的可解释性,我们提议了一个新颖的双线代表非定位关注(BR-NPA)战略,以捕捉与任务相关的人类解释性信息;目标模型首先提炼出高分辨率中间地貌图。从中,将具有代表性的特征根据地方对称特征的相似性加以分组,以便制作更精细、更精确的注意地图,突出与任务相关的部分;为了改进现有关注模式的可解释性,我们建议采用新的双线代表非定位关注(BR-NPA)战略,以掌握与任务相关的重要程度的信息;在涉及分类的多种现代深度模型中,可以很容易调整拟议模式;此外,还更准确、更快、更小的记忆性足迹,然后根据本地的对视觉关注网络进行精度分组,在不同的视觉分类中进行更多的视觉分析。