Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.
翻译:在图像字幕和视觉问题解答(VQA)中广泛使用了自下而上视觉关注机制,以便通过精细分析甚至多个推理步骤更深入地理解图像。在这项工作中,我们提议了一个自下而上和自上而下的综合关注机制,能够将注意力计算在物体和其他突出图像区域的水平上。这是需要考虑关注的自然基础。在我们的方法中,自下而上机制(基于快速R-CNN)提出了图像区域,每个区域都有相关的特性矢量,而自上而下机制则确定了特征加权。在图像字幕解析中应用这一方法,我们在 MSCO 测试服务器上的结果为任务确立了一个新的状态, 分别达到CIDER/SPICE/BLEU-4分数117.9、21.5和36.9。 展示了该方法的广泛适用性,在2017 VQA挑战中我们首先获得的VQA方法。