基于远地点定位的图像分类关注网络 (Salient Positions based Attention Network for Image Classification)

The self-attention mechanism has attracted wide publicity for its most important advantage of modeling long dependency, and its variations in computer vision tasks, the non-local block tries to model the global dependency of the input feature maps. Gathering global contextual information will inevitably need a tremendous amount of memory and computing resources, which has been extensively studied in the past several years. However, there is a further problem with the self-attention scheme: is all information gathered from the global scope helpful for the contextual modelling? To our knowledge, few studies have focused on the problem. Aimed at both questions this paper proposes the salient positions-based attention scheme SPANet, which is inspired by some interesting observations on the attention maps and affinity matrices generated in self-attention scheme. We believe these observations are beneficial for better understanding of the self-attention. SPANet uses the salient positions selection algorithm to select only a limited amount of salient points to attend in the attention map computing. This approach will not only spare a lot of memory and computing resources, but also try to distill the positive information from the transformation of the input feature maps. In the implementation, considering the feature maps with channel high dimensions, which are completely different from the general visual image, we take the squared power of the feature maps along the channel dimension as the saliency metric of the positions. In general, different from the non-local block method, SPANet models the contextual information using only the selected positions instead of all, along the channel dimension instead of space dimension. Our source code is available at https://github.com/likyoo/SPANet.

翻译：自留机制因其在模拟长期依赖性方面最重要的优势及其在计算机视野任务方面的差异而吸引了广泛的宣传。非本地区块试图模拟输入特征地图的全球依赖性。收集全球背景信息将不可避免地需要大量的记忆和计算资源,在过去几年中已经对此进行了广泛研究。然而,自留机制还存在另一个问题:从全球范围收集的所有信息是否都有助于背景建模?就我们的知识而言,很少有研究侧重于这一问题。本文针对两个问题提出了基于定位的SPANet计划,该计划的灵感来自对自留图和自留方案中生成的亲近矩阵的一些有趣的观察。我们认为,这些观测有助于更好地理解自留。SPANet使用突出位置选择算算法只选择有限数量的显著点用于关注地图的计算。这一方法不仅节省大量记忆和计算资源,而且还试图从输入特征图的转换中提取积极的信息。在实施过程中,从我们通用空间定位图的地面位置上,从我们通用的地面图的地貌图中,从普通的地平面图的地平面图中取出。