Answering semantically-complicated questions according to an image is challenging in Visual Question Answering (VQA) task. Although the image can be well represented by deep learning, the question is always simply embedded and cannot well indicate its meaning. Besides, the visual and textual features have a gap for different modalities, it is difficult to align and utilize the cross-modality information. In this paper, we focus on these two problems and propose a Graph Matching Attention (GMA) network. Firstly, it not only builds graph for the image, but also constructs graph for the question in terms of both syntactic and embedding information. Next, we explore the intra-modality relationships by a dual-stage graph encoder and then present a bilateral cross-modality graph matching attention to infer the relationships between the image and the question. The updated cross-modality features are then sent into the answer prediction module for final answer prediction. Experiments demonstrate that our network achieves state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset. The ablation studies verify the effectiveness of each modules in our GMA network.
翻译:在视觉问答(VQA)任务中,根据图像回答复杂的语义问题具有挑战性。虽然图像可以通过深层学习得到很好的反映,但问题总是简单的嵌入,不能很好地说明其含义。此外,视觉和文字特征对于不同模式存在差距,因此很难对跨模式信息加以协调和利用。在本文中,我们集中关注这两个问题并提出一个匹配关注的图表(GMA)网络。首先,它不仅为图像建图,而且还在合成和嵌入信息方面为问题建图。接下来,我们通过双阶段图形编码器探索内部模式关系,然后提出双边交叉模式图,将注意力匹配到图像和问题之间的关系。更新的交叉模式特征随后被发送到最后答案预测的答案预测模块中。实验表明,我们的网络在GQA数据集和VQA2.0数据集方面都取得了最新水平的性能。在我们的GA网络中,每个模块的有效性都得到了核实。