Visual understanding requires seamless integration between recognition and reasoning: beyond image-level recognition (e.g., detecting objects), systems must perform concept-level reasoning (e.g., inferring the context of objects and intents of people). However, existing methods only model the image-level features, and do not ground them and reason with background concepts such as knowledge graphs (KGs). In this work, we propose a novel visual question answering method, VQA-GNN, which unifies the image-level information and conceptual knowledge to perform joint reasoning of the scene. Specifically, given a question-image pair, we build a scene graph from the image, retrieve a relevant linguistic subgraph from ConceptNet and visual subgraph from VisualGenome, and unify these three graphs and the question into one joint graph, multimodal semantic graph. Our VQA-GNN then learns to aggregate messages and reason across different modalities captured by the multimodal semantic graph. In the evaluation on the VCR task, our method outperforms the previous scene graph-based Trans-VL models by over 4%, and VQA-GNN-Large, our model that fuses a Trans-VL further improves the state of the art by 2%, attaining the top of the VCR leaderboard at the time of submission. This result suggests the efficacy of our model in performing conceptual reasoning beyond image-level recognition for visual understanding. Finally, we demonstrate that our model is the first work to provide interpretability across visual and textual knowledge domains for the VQA task.
翻译:视觉理解要求认知和推理之间无缝的融合:除了图像层面的识别(例如,探测对象)之外,系统必须执行概念层面的推理(例如,推断人类对象和意图的背景)。然而,现有方法只是模拟图像层面的特征,而不是用知识图(KGs)等背景概念来解释这些特征和理由。在这项工作中,我们提出了一个新型的视觉回答方法,VQA-GNN,它将图像层面的信息和知识统一起来,以进行场景的联合推理。具体地说,鉴于一个问题式模型,我们从图像中建立场景图,从概念网和视觉Genome的视觉子图中检索相关的语言子图,并将这三个图和问题统一成一个联合图,即多式语义图(KGGGs)。我们VQA-GNNNN随后学习了一种综合信息以及多式语义图所捕捉到的不同方法。在 VCRI 任务评估中,我们的方法超越了先前以图像为基础的 Transa-VL模型,通过4%和 VQA-GNA 图像级的图像级的高级分析,从而显示我们最终的图像分析过程的图像分析。