To build Video Question Answering (VideoQA) systems capable of assisting humans in daily activities, seeking answers from long-form videos with diverse and complex events is a must. Existing multi-modal VQA models achieve promising performance on images or short video clips, especially with the recent success of large-scale multi-modal pre-training. However, when extending these methods to long-form videos, new challenges arise. On the one hand, using a dense video sampling strategy is computationally prohibitive. On the other hand, methods relying on sparse sampling struggle in scenarios where multi-event and multi-granularity visual reasoning are required. In this work, we introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA. Specifically, MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules that adaptively select frames and image regions that are closely relevant to the question itself. Visual concepts at different granularities are then processed efficiently through an attention module. In addition, MIST iteratively conducts selection and attention over multiple layers to support reasoning over multiple events. The experimental results on four VideoQA datasets, including AGQA, NExT-QA, STAR, and Env-QA, show that MIST achieves state-of-the-art performance and is superior at computation efficiency and interpretability.
翻译:要建立能够帮助人类开展日常活动的视频问答系统(VideoQA),就必须建立能够帮助人类开展日常活动的视频解答系统,从具有多种复杂事件的长式视频解答中寻找答案。现有的多模式VQA模型在图像或短视频剪辑中取得有希望的性能,特别是最近大型多模式预培训成功。然而,当将这些方法推广到长式视频时,就会出现新的挑战。一方面,使用密集的视频取样战略,在计算上令人望而却步。另一方面,在需要多事件和多特征视觉推理的情景中,依靠稀少的抽样斗争的方法。在这项工作中,我们引入了名为多模式的多模式VQQA模型(MIST),在图像或短视频剪辑短片片片片片片段上,MIST将传统的密集空间时空自留能力纳入级段和区域选择模块,这些模块与问题本身密切相关。不同的微粒子和多色图像区域,然后通过一个关注度、多层次的图像分析模块,在多层次上进行视觉概念处理,并显示多层次的磁带结果。