Existing visual question reasoning methods usually fail to explicitly discover the inherent causal mechanism and ignore the complex event-level understanding that requires jointly modeling cross-modal event temporality and causality. In this paper, we propose an event-level visual question reasoning framework named Cross-Modal Question Reasoning (CMQR), to explicitly discover temporal causal structure and mitigate visual spurious correlation by causal intervention. To explicitly discover visual causal structure, the Visual Causality Discovery (VCD) architecture is proposed to find question-critical scene temporally and disentangle the visual spurious correlations by attention-based front-door causal intervention module named Local-Global Causal Attention Module (LGCAM). To align the fine-grained interactions between linguistic semantics and spatial-temporal representations, we build an Interactive Visual-Linguistic Transformer (IVLT) that builds the multi-modal co-occurrence interactions between visual and linguistic content. Extensive experiments on four datasets demonstrate the superiority of CMQR for discovering visual causal structures and achieving robust question reasoning.
翻译:现有的视觉问答推理方法通常无法明确发现内在的因果机制,并忽略了需要联合建模跨模态事件时间性和因果性的复杂事件层级理解。本文提出了一种事件级视觉问题推理框架称为交叉模态问题推理(CMQR),以明确发现时间因果结构并通过因果干预减少视觉虚假相关性。为了明确发现视觉因果结构,提出了视觉因果发现(VCD)架构,以时间方式发现问句关键场景并通过基于注意力的前摄因果干预模块命名的局部-全局因果注意力模块(LGCAM)解开视觉虚假相关性。为了对齐语言语义和时空表示之间的细粒度交互作用,我们构建了一种交互式视觉语言变换器(IVLT),建立了视觉和语言内容之间的多模态共现交互作用。对四个数据集进行的大量实验表明了CMQR在发现视觉因果结构和实现强韧问题推理方面的优越性。