Existing visual question answering methods tend to capture the cross-modal spurious correlations, and fail to discover the true causal mechanism that facilitates reasoning truthfully based on the dominant visual evidence and the question intention. Additionally, the existing methods usually ignore the cross-modal event-level understanding that requires to jointly model event temporality, causality, and dynamics. In this work, we focus on event-level visual question answering from a new perspective, i.e., cross-modal causal relational reasoning, by introducing causal intervention methods to discover the true causal structures for visual and linguistic modalities. Specifically, we propose a novel event-level visual question answering framework named Cross-Modal Causal RelatIonal Reasoning (CMCIR), to achieve robust causality-aware visual-linguistic question answering. To discover cross-modal causal structures, the Causality-aware Visual-Linguistic Reasoning (CVLR) module is proposed to collaboratively disentangle the visual and linguistic spurious correlations via front-door and back-door causal interventions. To model the fine-grained interactions between linguistic semantics and spatial-temporal representations, we build a Spatial-Temporal Transformer (STT) that builds the multi-modal co-occurrence interactions between visual and linguistic content. To adaptively fuse the causality-ware visual and linguistic features, we introduce a Visual-Linguistic Feature Fusion (VLFF) module that leverages the hierarchical linguistic semantic relations as the guidance to learn the global semantic-aware visual-linguistic representations adaptively. Extensive experiments on four event-level datasets demonstrate the superiority of our CMCR for discovering visual-linguistic causal structures and achieving robust event-level visual question answering.
翻译:现有视觉问题解答方法往往能够捕捉跨式语言层次的虚假关联,并且未能发现真正的因果机制,这种机制有助于根据占支配地位的视觉证据和问题意图进行真实的推理。 此外,现有方法通常忽视交叉式事件层面的理解,而这种理解需要共同模拟事件的时间性、因果关系和动态。在这项工作中,我们侧重于从新角度,即跨式的视觉问题解答,即跨式因果关系推理,方法是引入因果干预方法,以发现视觉和语言模式的真正的动态因果结构。具体地说,我们提议建立一个新的事件层面的视觉问题解答框架,名为跨式Modal Causal Cailalal Airal 解释理由(CMCIR ), 实现强烈的因果关系- 认知- 视觉- 语言- 视觉- 语言- 视觉- 语言- 语言- 直观- 语言- 直观- 直观- 直观- 直观- 直观- 直观- 直观- 直观- 直观- 直观- 直观- 直观- 直观- 直观- 直观- 直观- 直观- 直观- 直观- 和直观- 直观- 直观- 直观- 直观- 直观- 直观- 直观- 直观- 直观- 直观- 直观- 直观- 直观- 直观- 直观- 直观- 直观- 直观- 构造- 直观- 构造- 构造- 构造- 构造- 构造- 构造- 构造- 构造- 构造- 构造- 构建- 构建- 构建- 构建- 构建- 构建- 构建- 构建- 构建- 和- 和- 构建- 、- 、- 和- 构建- 构建- 构建- 构建- 建- 、 、 构建- 、 、 构建- 、 、 、 、 、 、 构建- 构建- 、 、 、 构建- 构建- 构建- 、 、 、 、 构建- 、