Self-attention networks have shown remarkable progress in computer vision tasks such as image classification. The main benefit of the self-attention mechanism is the ability to capture long-range feature interactions in attention-maps. However, the computation of attention-maps requires a learnable key, query, and positional encoding, whose usage is often not intuitive and computationally expensive. To mitigate this problem, we propose a novel self-attention module with explicitly modeled attention-maps using only a single learnable parameter for low computational overhead. The design of explicitly modeled attention-maps using geometric prior is based on the observation that the spatial context for a given pixel within an image is mostly dominated by its neighbors, while more distant pixels have a minor contribution. Concretely, the attention-maps are parametrized via simple functions (e.g., Gaussian kernel) with a learnable radius, which is modeled independently of the input content. Our evaluation shows that our method achieves an accuracy improvement of up to 2.2% over the ResNet-baselines in ImageNet ILSVRC and outperforms other self-attention methods such as AA-ResNet152 in accuracy by 0.9% with 6.4% fewer parameters and 6.7% fewer GFLOPs. This result empirically indicates the value of incorporating geometric prior into self-attention mechanism when applied in image classification.
翻译:自关注网络在图像分类等计算机视觉任务中显示出显著的进展。 自关注机制的主要好处是能够捕捉到关注图中远程特征互动。 然而, 关注图的计算需要一个可学习的密钥、 查询和位置编码, 其使用通常不直观, 且计算成本很高。 为了缓解这一问题, 我们提议了一个带有明确模拟关注图的自关注模型的新模块, 该模块仅使用单一的可学习参数, 用于低计算间接费用。 使用前几何测量仪设计明确的模拟关注图, 其依据的观察是, 图像中某个像素的空间环境大多由邻居主导, 而更远的像素则有微小的贡献。 具体地说, 关注图通过简单的函数( 例如, 高斯内骨内内内内内内内嵌, 与输入内容独立的模型。 我们的评估显示, 在图像Net 基线上, 将2. 2% 的明显模拟关注图象图象环境环境环境环境环境环境环境环境环境环境环境环境环境环境环境环境环境环境环境环境环境环境环境, 将更低的图像网络, 将 AS- ASALVRC1552 的自我定位系统内, 的自我定位 的自我定位系统内, 的自我定位 也显示, 的自我定位系统内, 的自我定位 将这种自我定位 的自我定位 的自我定位比 的自我定位 的自我定位比 的自我定位比 的自我定位比 的自我定位比 的自我定位比 的自我定位比 的自我定位比 更小。