The current success of modern visual reasoning systems is arguably attributed to cross-modality attention mechanisms. However, in deliberative reasoning such as in VQA, attention is unconstrained at each step, and thus may serve as a statistical pooling mechanism rather than a semantic operation intended to select information relevant to inference. This is because at training time, attention is only guided by a very sparse signal (i.e. the answer label) at the end of the inference chain. This causes the cross-modality attention weights to deviate from the desired visual-language bindings. To rectify this deviation, we propose to guide the attention mechanism using explicit linguistic-visual grounding. This grounding is derived by connecting structured linguistic concepts in the query to their referents among the visual objects. Here we learn the grounding from the pairing of questions and images alone, without the need for answer annotation or external grounding supervision. This grounding guides the attention mechanism inside VQA models through a duality of mechanisms: pre-training attention weight calculation and directly guiding the weights at inference time on a case-by-case basis. The resultant algorithm is capable of probing attention-based reasoning models, injecting relevant associative knowledge, and regulating the core reasoning process. This scalable enhancement improves the performance of VQA models, fortifies their robustness to limited access to supervised data, and increases interpretability.
翻译:现代视觉推理系统目前的成功可以说可归因于交叉式关注机制。然而,在诸如《VQA》这样的审议推理中,注意力在每一步都不受限制,因此,它可能作为一个统计集合机制,而不是旨在选择与推理有关的信息的语义操作。这是因为在培训时间,注意力仅受推理链结尾处非常稀少的信号(即答案标签)指导。这导致交叉式关注权重偏离了所需的视觉语言约束。为了纠正这一偏差,我们建议使用明确的语言视觉定位来指导关注机制。这一基础是将查询中的结构性语言概念与其在视觉对象中的参照点联系起来。我们在这里学习的是将问题和图像单独配对的基础,而不必回答注释或外部定位监督。这一基础引导了VQA模型内部的注意机制的双重性:培训前重度计算,直接指导在个案基础上的推理时间的权重,我们提议采用明确的语言-视觉定位基础。这一基础是将查询中的结构语言概念与其视觉对象相连接。我们学习的是将问题和图像的推理推理。这种推推推推,能够将基于相关的推理推推推推推推,从而推推推推其相关的数据推理。