As in many tasks combining vision and language, both modalities play a crucial role in Visual Question Answering (VQA). To properly solve the task, a given model should both understand the content of the proposed image and the nature of the question. While the fusion between modalities, which is another obviously important part of the problem, has been highly studied, the vision part has received less attention in recent work. Current state-of-the-art methods for VQA mainly rely on off-the-shelf object detectors delivering a set of object bounding boxes and embeddings, which are then combined with question word embeddings through a reasoning module. In this paper, we propose an in-depth study of the vision-bottleneck in VQA, experimenting with both the quantity and quality of visual objects extracted from images. We also study the impact of two methods to incorporate the information about objects necessary for answering a question, in the reasoning module directly, and earlier in the object selection stage. This work highlights the importance of vision in the context of VQA, and the interest of tailoring vision methods used in VQA to the task at hand.
翻译:在许多将视觉和语言相结合的任务中,两种模式在视觉问题解答(VQA)中都发挥着关键作用。为了适当解决这一问题,一个特定模式应该既了解拟议图像的内容,又了解问题的性质。虽然对各种模式之间的融合(显然是问题的另一个重要部分)进行了深入的研究,但最近的工作对视觉部分的关注较少。目前VQA的最新方法主要依赖提供一组物体捆绑盒和嵌入的现成物体探测器,这些探测器随后又与一个推理模块的疑问嵌入词相结合。我们在本文件中提议对VQA的视觉布料进行深入研究,同时试验从图像中提取的视觉物体的数量和质量。我们还研究两种方法的影响,即直接在推理模块中和在目标选择阶段的更早阶段纳入回答问题所需的物体信息。这项工作突出了视觉在VQA中的重要性,以及使VQA使用的视觉方法符合手头任务。