Attention modules for Convolutional Neural Networks (CNNs) are an effective method to enhance performance on multiple computer-vision tasks. While existing methods appropriately model channel-, spatial- and self-attention, they primarily operate in a feedforward bottom-up manner. Consequently, the attention mechanism strongly depends on the local information of a single input feature map and does not incorporate relatively semantically-richer contextual information available at higher layers that can specify "what and where to look" in lower-level feature maps through top-down information flow. Accordingly, in this work, we propose a lightweight top-down attention module (TDAM) that iteratively generates a "visual searchlight" to perform channel and spatial modulation of its inputs and outputs more contextually-relevant feature maps at each computation step. Our experiments indicate that TDAM enhances the performance of CNNs across multiple object-recognition benchmarks and outperforms prominent attention modules while being more parameter and memory efficient. Further, TDAM-based models learn to "shift attention" by localizing individual objects or features at each computation step without any explicit supervision resulting in a 5% improvement for ResNet50 on weakly-supervised object localization. Source code and models are publicly available at: https://github.com/shantanuj/TDAM_Top_down_attention_module .
翻译:进化神经网络关注模块(CNNs)是提高多种计算机视野任务绩效的有效方法。现有方法适当模拟频道、空间和自控,但主要以自下而上的方式运作。因此,关注机制在很大程度上取决于单一输入特征地图的本地信息,不包含高层次相对精度丰富的背景信息,而高层次则能够通过自上而下的信息流在较低层次的特征地图中指明“什么和在哪里查找”。因此,在这项工作中,我们提出一个轻量级自上而下关注模块(TDAM),反复生成一个“视觉探照灯”,用于对每个计算步骤的输入和产出进行频道和空间调整,使其与背景更加相关。我们的实验表明,TDAM提高了CNN在多个物体识别基准上的性能,并超越了显著的关注模块,同时提高了参数和记忆效率。此外,基于TDAM的模型通过将单个物体或特性定位到每个计算台阶的“转移关注”,而无需任何明确的监督,从而在5 % 改进ResNet_Annex-Annex askal data:在微软的代码/http-sult-sultal-Ambly accildaldaldaldormodal_Action_Ambaldaldormodormodormaldormald.