In this paper, we propose a novel Multi-Modal Scene Graph with Kolmogorov-Arnold Expert Network for Audio-Visual Question Answering (SHRIKE). The task aims to mimic human reasoning by extracting and fusing information from audio-visual scenes, with the main challenge being the identification of question-relevant cues from the complex audio-visual content. Existing methods fail to capture the structural information within video, and suffer from insufficient fine-grained modeling of multi-modal features. To address these issues, we are the first to introduce a new multi-modal scene graph that explicitly models the objects and their relationship as a visually grounded, structured representation of the audio-visual scene. Furthermore, we design a Kolmogorov-Arnold Network~(KAN)-based Mixture of Experts (MoE) to enhance the expressive power of the temporal integration stage. This enables more fine-grained modeling of cross-modal interactions within the question-aware fused audio-visual representation, leading to capture richer and more nuanced patterns and then improve temporal reasoning performance. We evaluate the model on the established MUSIC-AVQA and MUSIC-AVQA v2 benchmarks, where it achieves state-of-the-art performance. Code and model checkpoints will be publicly released.
翻译:本文提出了一种新颖的基于Kolmogorov-Arnold专家网络的多模态场景图模型,用于视听问答任务(SHRIKE)。该任务旨在通过提取并融合视听场景中的信息来模拟人类推理过程,其主要挑战在于从复杂的视听内容中识别与问题相关的线索。现有方法未能充分捕捉视频内部的结构信息,且在多模态特征的细粒度建模方面存在不足。为解决这些问题,我们首次引入一种新型多模态场景图,将场景中的对象及其关系显式建模为基于视觉的、结构化的视听场景表示。此外,我们设计了一种基于Kolmogorov-Arnold网络(KAN)的专家混合模块,以增强时序融合阶段的表达能力。这使得在问题感知的融合视听表征中能实现更细粒度的跨模态交互建模,从而捕获更丰富、更细微的模式,进而提升时序推理性能。我们在已建立的MUSIC-AVQA和MUSIC-AVQA v2基准测试上评估了该模型,其性能达到了当前最优水平。代码与模型检查点将公开发布。