Visual Question Answering (VQA) is concerned with answering free-form questions about an image. Since it requires a deep semantic and linguistic understanding of the question and the ability to associate it with various objects that are present in the image, it is an ambitious task and requires multi-modal reasoning from both computer vision and natural language processing. We propose Graphhopper, a novel method that approaches the task by integrating knowledge graph reasoning, computer vision, and natural language processing techniques. Concretely, our method is based on performing context-driven, sequential reasoning based on the scene entities and their semantic and spatial relationships. As a first step, we derive a scene graph that describes the objects in the image, as well as their attributes and their mutual relationships. Subsequently, a reinforcement learning agent is trained to autonomously navigate in a multi-hop manner over the extracted scene graph to generate reasoning paths, which are the basis for deriving answers. We conduct an experimental study on the challenging dataset GQA, based on both manually curated and automatically generated scene graphs. Our results show that we keep up with a human performance on manually curated scene graphs. Moreover, we find that Graphhopper outperforms another state-of-the-art scene graph reasoning model on both manually curated and automatically generated scene graphs by a significant margin.
翻译:视觉问题解答( VQA) 涉及解答关于图像的自由形式问题。 由于它需要对这一问题有深刻的语义和语言理解, 并且能够将其与图像中存在的各种对象联系起来, 这是一项雄心勃勃的任务, 需要计算机视觉和自然语言处理的多式推理。 我们提议了Gaphepper, 这是一种创新的方法, 通过整合知识图形推理、 计算机视觉和自然语言处理技术来应对任务。 具体地说, 我们的方法基于基于现场实体及其语义和空间关系进行背景驱动、 顺序推理。 作为第一步, 我们得出一个描述图像中对象及其属性和相互关系的场景图。 随后, 一个强化学习剂经过培训, 能够以多角度在提取的场景图上自主导航, 从而产生推理路径, 这是解答案的基础。 我们对具有挑战性的 GQA 进行一项实验性研究, 其基础是手动的和自动生成的场景图。 我们的结果显示, 我们用手动的曲线图显示, 将人类的性表现与手动的场景场景图一样, 以及另一个手动的平面图都以手动的平面图成。