While image understanding on recognition-level has achieved remarkable advancements, reliable visual scene understanding requires comprehensive image understanding on recognition-level but also cognition-level, which calls for exploiting the multi-source information as well as learning different levels of understanding and extensive commonsense knowledge. In this paper, we propose a novel Cognitive Attention Network (CAN) for visual commonsense reasoning to achieve interpretable visual understanding. Specifically, we first introduce an image-text fusion module to fuse information from images and text collectively. Second, a novel inference module is designed to encode commonsense among image, query and response. Extensive experiments on large-scale Visual Commonsense Reasoning (VCR) benchmark dataset demonstrate the effectiveness of our approach. The implementation is publicly available at https://github.com/tanjatang/CAN
翻译:虽然对认知水平的图像理解取得了显著进展,但可靠的视觉场景理解要求对认知水平和认知水平进行全面的图像理解,这就需要利用多种来源的信息,学习不同的理解和广泛的常识;在本文件中,我们提议建立一个新的认知关注网络(CAN),用于视觉常识推理,以实现可解释的视觉理解。具体地说,我们首先采用图像-文本融合模块,将图像和文本的信息集成为一体。第二,设计了一个新颖的推论模块,将图像、查询和响应之间的常识编码。关于大规模视觉常识解释基准数据集的广泛实验显示了我们的方法的有效性。实施方法可在https://github.comtanjatang/CAN上公开查阅。