Visual Question Answering (VQA) is of tremendous interest to the research community with important applications such as aiding visually impaired users and image-based search. In this work, we explore the use of scene graphs for solving the VQA task. We conduct experiments on the GQA dataset which presents a challenging set of questions requiring counting, compositionality and advanced reasoning capability, and provides scene graphs for a large number of images. We adopt image + question architectures for use with scene graphs, evaluate various scene graph generation techniques for unseen images, propose a training curriculum to leverage human-annotated and auto-generated scene graphs, and build late fusion architectures to learn from multiple image representations. We present a multi-faceted study into the use of scene graphs for VQA, making this work the first of its kind.
翻译:视觉问题解答(VQA)是研究界非常感兴趣的,具有重要的应用,例如协助视障用户和图像搜索。在这项工作中,我们探索如何使用场景图解解决VQA任务。我们在GQA数据集上进行实验,该数据集提出了一套具有挑战性的问题,需要计数、组成性和高级推理能力,并为大量图像提供了场景图。我们采用了图像+问题图,用于与场景图一起使用,评估各种隐形图像的场景图生成技术,提出培训课程,以利用人类附加说明的和自动生成的场景图解,并构建从多个图像图解中学习的迟聚结构。我们展示了对VQA图像使用场景图的多面研究,使图像成为首类。