Existing visual question answering methods tend to capture the cross-modal spurious correlations and fail to discover the true causal mechanism that facilitates reasoning truthfully based on the dominant visual evidence and the question intention. Additionally, the existing methods usually ignore the cross-modal event-level understanding that requires to jointly model event temporality, causality, and dynamics. In this work, we focus on event-level visual question answering from a new perspective, i.e., cross-modal causal relational reasoning, by introducing causal intervention methods to discover the true causal structures for visual and linguistic modalities. Specifically, we propose a novel event-level visual question answering framework named Cross-Modal Causal RelatIonal Reasoning (CMCIR), to achieve robust causality-aware visual-linguistic question answering. To discover cross-modal causal structures, the Causality-aware Visual-Linguistic Reasoning (CVLR) module is proposed to collaboratively disentangle the visual and linguistic spurious correlations via front-door and back-door causal interventions. To model the fine-grained interactions between linguistic semantics and spatial-temporal representations, we build a Spatial-Temporal Transformer (STT) that creates multi-modal co-occurrence interactions between visual and linguistic content. To adaptively fuse the causality-ware visual and linguistic features, we introduce a Visual-Linguistic Feature Fusion (VLFF) module that leverages the hierarchical linguistic semantic relations as the guidance to learn the global semantic-aware visual-linguistic representations adaptively. Extensive experiments on four event-level datasets demonstrate the superiority of our CMCIR in discovering visual-linguistic causal structures and achieving robust event-level visual question answering.
翻译:现有的视觉问答方法往往捕捉到跨模态的虚假相关性,未能发现真正促进基于主导视觉证据和问题意图的真实推理的真正因果机制。此外,现有方法通常忽略跨模态的事件级理解,需要联合建模事件时间性、因果性和动态性。在这项工作中,我们从新的角度,即跨模态因果关系推理,着眼于事件级视觉问答,通过引入因果干预方法来发现视觉和语言模态的真正因果结构。具体地,我们提出了一种新的事件级视觉问答框架,名为跨模态因果关系推理(CMCIR),以实现强大的因果感知视觉语言问答。为了发现跨模态的因果结构,我们提出了一种名为因果感知视觉语言推理(CVLR)模块,通过前门和后门因果干预协同分离视觉和语言的虚假相关性。为了建模语言语义和空间 - 时间表征之间的细粒度交互,我们构建了一个空间 - 时间变压器(STT),它创建了视觉和语言内容之间的多模态共现交互。为了自适应地融合因果感知的视觉和语言特征,我们引入了一个名为视觉语言特征融合(VLFF)模块,它利用分层语言语义关系作为指导来自适应地学习全局语义感知的视觉 - 语言表示。在四个事件级数据集上的大量实验表明,我们的CMCIR在发现视觉语言因果结构和实现强大的事件级视觉问答方面具有优越性。