视频问题解答的等同和异同基地 (Equivariant and Invariant Grounding for Video Question Answering)

Video Question Answering (VideoQA) is the task of answering the natural language questions about a video. Producing an answer requires understanding the interplay across visual scenes in video and linguistic semantics in question. However, most leading VideoQA models work as black boxes, which make the visual-linguistic alignment behind the answering process obscure. Such black-box nature calls for visual explainability that reveals ``What part of the video should the model look at to answer the question?''. Only a few works present the visual explanations in a post-hoc fashion, which emulates the target model's answering process via an additional method. Nonetheless, the emulation struggles to faithfully exhibit the visual-linguistic alignment during answering. Instead of post-hoc explainability, we focus on intrinsic interpretability to make the answering process transparent. At its core is grounding the question-critical cues as the causal scene to yield answers, while rolling out the question-irrelevant information as the environment scene. Taking a causal look at VideoQA, we devise a self-interpretable framework, Equivariant and Invariant Grounding for Interpretable VideoQA (EIGV). Specifically, the equivariant grounding encourages the answering to be sensitive to the semantic changes in the causal scene and question; in contrast, the invariant grounding enforces the answering to be insensitive to the changes in the environment scene. By imposing them on the answering process, EIGV is able to distinguish the causal scene from the environment information, and explicitly present the visual-linguistic alignment. Extensive experiments on three benchmark datasets justify the superiority of EIGV in terms of accuracy and visual interpretability over the leading baselines.

翻译：视频解答( VideoQA ) 的任务是解答视频的自然语言问题。生成解答需要理解视频和语言语义的视觉场面之间的相互作用。然而, 多数领先的视频QA 模型作为黑盒工作, 使得回答过程背后的视觉语言一致变得模糊。这种黑盒性质要求直观解释“ 模型应该看到哪一部分视频来解答问题? ” 。只有少数作品以后热式方式展示可视化解释性解释, 以额外的方式模仿目标模型的回答过程。尽管如此, 视觉模拟努力忠实展示视频语言在回答过程中的校正校正校正校正。我们专注于内在的解读性, 使回答过程变得透明。在核心部分, 将问题批评性提示作为解答题的场景点, 同时将与问题相关的信息作为环境场景进行滚动。在视频QA 上, 我们设计了一个可自我解析的框架, 将直观和直观的直观的直径直径直径直径推直径直径直径直径直径直径直径直径直径直径直到直径直径直径直径直径直径直到直到直径直径直到直到直径直径直到直到直的直的直直到直到直直的直直的直直直直直直直直直到直到直到直到直到直到直到直径直径直到直到直到直到直到直到直到直到直到直到直到直到直到直到直方的直到直到直径直方的轨道直到直到直到直到直到直到直到直到直到直到直到直到直到直到直到直到直的直方的直的直的直的直到直到直到直的直的直到直到直到直到直到直方的直到直到直到直方的直方的直方的直方的直方的直方的直方的直方的直方的直方的直方的直方的直方的直方的直方的直方的直方的直方的直方的直方。。。。。。。。。