A number of recent works have proposed attention models for Visual Question Answering (VQA) that generate spatial maps highlighting image regions relevant to answering the question. In this paper, we argue that in addition to modeling "where to look" or visual attention, it is equally important to model "what words to listen to" or question attention. We present a novel co-attention model for VQA that jointly reasons about image and question attention. In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN). Our model improves the state-of-the-art on the VQA dataset from 60.3% to 60.5%, and from 61.6% to 63.3% on the COCO-QA dataset. By using ResNet, the performance is further improved to 62.1% for VQA and 65.4% for COCO-QA.
翻译:最近的一些著作提出了视觉问题解答(VQA)关注模型,这些模型生成空间地图,突出与回答问题有关的图像区域。在本文中,我们争论说,除了制作“在哪里看”或视觉关注模型外,同样重要的是制作“什么词听”或问题关注模型。我们为VQA展示了一个新的“共同关注”模型,共同解释图像和问题关注的原因。此外,我们关于该问题的模型理由(以及随后通过共同关注机制绘制的图像)通过新颖的一维共振神经网络(CNN)以等级化的方式出现。我们的模型将VQA数据集的最新水平从60.3%提高到60.5%,COCO-QA数据集从61.6%提高到63.3%。通过使用ResNet,VQA的绩效进一步提高到62.1%,CO-QA的绩效提高到65.4%。