Visual Question answering is a challenging problem requiring a combination of concepts from Computer Vision and Natural Language Processing. Most existing approaches use a two streams strategy, computing image and question features that are consequently merged using a variety of techniques. Nonetheless, very few rely on higher level image representations, which allow to capture semantic and spatial relationships. In this paper, we propose a novel graph-based approach for Visual Question Answering. Our method combines a graph learner module, which learns a question specific graph representation of the input image, with the recent concept of graph convolutions, aiming to learn image representations that capture question specific interactions. We test our approach on the VQA v2 dataset using a simple baseline architecture enhanced by the proposed graph learner module. We obtain state of the art results with 65.77% accuracy and demonstrate the interpretability of the proposed method.
翻译:视觉问题解答是一个具有挑战性的问题,需要将计算机视觉和自然语言处理的概念结合起来。大多数现有方法都使用两种流战略,即计算图像和问题特征,然后使用多种技术将其合并。然而,很少有人依靠更高层次的图像显示,从而能够捕捉语义和空间关系。在本文中,我们提出了基于新颖图表的视觉问题解答方法。我们的方法将一个图解学习器模块结合起来,该模块学习输入图像的具体图解,与最近的图解组合概念相结合,目的是学习能够捕捉特定互动的图解。我们用一个得到拟议图解学习器模块强化的简单基线结构测试我们关于VQA v2数据集的方法。我们用65.77%的准确率获取最新艺术结果,并展示拟议方法的可解释性。