Visual question answering (VQA) demands simultaneous comprehension of both the image visual content and natural language questions. In some cases, the reasoning needs the help of common sense or general knowledge which usually appear in the form of text. Current methods jointly embed both the visual information and the textual feature into the same space. However, how to model the complex interactions between the two different modalities is not an easy task. In contrast to struggling on multimodal feature fusion, in this paper, we propose to unify all the input information by natural language so as to convert VQA into a machine reading comprehension problem. With this transformation, our method not only can tackle VQA datasets that focus on observation based questions, but can also be naturally extended to handle knowledge-based VQA which requires to explore large-scale external knowledge base. It is a step towards being able to exploit large volumes of text and natural language processing techniques to address VQA problem. Two types of models are proposed to deal with open-ended VQA and multiple-choice VQA respectively. We evaluate our models on three VQA benchmarks. The comparable performance with the state-of-the-art demonstrates the effectiveness of the proposed method.
翻译:视觉问题解答(VQA)要求同时理解图像视觉内容和自然语言问题。在某些情况下,推理需要常识或一般知识的帮助,通常以文本的形式出现。目前的方法是将视觉信息和文字特征共同嵌入同一个空间。然而,如何模拟两种不同模式之间的复杂互动并不是一件容易的任务。与在多式特性融合上挣扎相反,我们在本文件中提议将所有输入信息用自然语言进行统一,以便将VQA转换成机器阅读理解问题。随着这种转变,我们的方法不仅能够处理以观察为基础的问题为重点的VQA数据集,而且还可以自然地推广到处理基于知识的VQA数据集,这需要探索大规模的外部知识基础。这是朝着能够利用大量文本和自然语言处理VQA问题的方法迈出的一步。我们建议两种模式分别处理开放式VQA和多选取VQA问题。我们评估了三个VQA基准的模型。与艺术状态的可比较性表现展示了拟议方法的有效性。