Given a question-image input, the Visual Commonsense Reasoning (VCR) model can predict an answer with the corresponding rationale, which requires inference ability from the real world. The VCR task, which calls for exploiting the multi-source information as well as learning different levels of understanding and extensive commonsense knowledge, is a cognition-level scene understanding task. The VCR task has aroused researchers' interest due to its wide range of applications, including visual question answering, automated vehicle systems, and clinical decision support. Previous approaches to solving the VCR task generally rely on pre-training or exploiting memory with long dependency relationship encoded models. However, these approaches suffer from a lack of generalizability and losing information in long sequences. In this paper, we propose a parallel attention-based cognitive VCR network PAVCR, which fuses visual-textual information efficiently and encodes semantic information in parallel to enable the model to capture rich information for cognition-level inference. Extensive experiments show that the proposed model yields significant improvements over existing methods on the benchmark VCR dataset. Moreover, the proposed model provides intuitive interpretation into visual commonsense reasoning.
翻译:考虑到一个问题图像输入,视觉常识推理(VCR)模型可以预测一个带有相应理由的答案,这需要真实世界的推断能力。VCR任务要求利用多源信息以及学习不同的理解水平和广泛的常识知识,这是一项认知层面的任务。VCR任务引起了研究人员的兴趣,因为其应用范围广泛,包括视觉回答、自动车辆系统和临床决策支持。以前解决VCR任务的方法一般依赖培训前或利用长期依赖关系编码模型的记忆。然而,这些方法因缺乏通用性和长序列信息而受到影响。在本文件中,我们建议平行利用基于关注的认知 VCR 网络 PAVRCR,将视觉-文字信息有效结合,并同时输入语义信息编码,使模型能够捕捉丰富的信息,以便进行认知-推断。广泛的实验显示,拟议的模型在VCR数据集的基准现有方法上取得了重大改进。此外,拟议的模型提供了直观的视觉推理学。