Visual Commonsense Reasoning (VCR) predicts an answer with corresponding rationale, given a question-image input. VCR is a recently introduced visual scene understanding task with a wide range of applications, including visual question answering, automated vehicle systems, and clinical decision support. Previous approaches to solving the VCR task generally rely on pre-training or exploiting memory with long dependency relationship encoded models. However, these approaches suffer from a lack of generalizability and prior knowledge. In this paper we propose a dynamic working memory based cognitive VCR network, which stores accumulated commonsense between sentences to provide prior knowledge for inference. Extensive experiments show that the proposed model yields significant improvements over existing methods on the benchmark VCR dataset. Moreover, the proposed model provides intuitive interpretation into visual commonsense reasoning. A Python implementation of our mechanism is publicly available at https://github.com/tanjatang/DMVCR
翻译:视频常识推理(VCR)预测了一个答案,并给出了相应的理由。视频常识解析(VCR)是最近推出的视觉现场理解任务,其应用范围很广,包括视觉问答、自动车辆系统和临床决策支持。以前解决视频常识解析任务的方法通常依赖于预先培训或利用长期依赖关系编码模型的记忆。然而,这些方法缺乏普遍性和先前的知识。我们在此文件中提议建立一个动态的工作记忆认知录像回路网络,存储在判决之间积累的共通信息,以提供先前的推断知识。广泛的实验表明,拟议的模型比VCR数据集的基准现有方法有显著改进。此外,拟议的模型为视觉常识推理提供了直觉解释。我们机制的Python实施情况可在https://github.com/tanjatang/DMVCR中公开查阅。