Video Question Answering (VideoQA) is the task of answering questions about a video. At its core is understanding the alignments between visual scenes in video and linguistic semantics in question to yield the answer. In leading VideoQA models, the typical learning objective, empirical risk minimization (ERM), latches on superficial correlations between video-question pairs and answers as the alignments. However, ERM can be problematic, because it tends to over-exploit the spurious correlations between question-irrelevant scenes and answers, instead of inspecting the causal effect of question-critical scenes. As a result, the VideoQA models suffer from unreliable reasoning. In this work, we first take a causal look at VideoQA and argue that invariant grounding is the key to ruling out the spurious correlations. Towards this end, we propose a new learning framework, Invariant Grounding for VideoQA (IGV), to ground the question-critical scene, whose causal relations with answers are invariant across different interventions on the complement. With IGV, the VideoQA models are forced to shield the answering process from the negative influence of spurious correlations, which significantly improves the reasoning ability. Experiments on three benchmark datasets validate the superiority of IGV in terms of accuracy, visual explainability, and generalization ability over the leading baselines.
翻译:视频解答( VideoQA ) 是解答视频问题的任务。 其核心是了解视频和语言语义的视觉场景之间的匹配, 以得出答案。 在主要视频QA 模型中, 典型的学习目标、 实验风险最小化( ERM ) 、 视频问题对对对和答案之间的表面相关性拉小片。 然而, 机构风险管理可能会有问题, 因为它倾向于过度利用与问题有关的场景和答案之间的虚假关联, 而不是检查问题关键场景的因果关系。 因此, 视频QA 模型受到不可靠的推理。 在这项工作中, 我们首先从因果关系角度审视视频QA 模型, 并论证不变化的地面是排除虚假相关性的关键。 为此, 我们提出一个新的学习框架, “ 视频QA 动态基础”, 其与答案的因果关系在补充性发言中是反复的。 视频QA 模型被迫通过不可靠的推理推理来掩盖了图像QA 的精确性, 从而大大地解释I 的精确性实验性 。