Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training
翻译:视频理解是计算机视觉领域最具挑战性的前沿方向,要求模型能够对复杂的时空关系、长期依赖关系以及多模态证据进行推理。近期兴起的视频大型多模态模型(Video-LMMs)通过将视觉编码器与基于解码器的强大语言模型相结合,在视频理解任务中展现出卓越能力。然而,将这些模型从基础感知系统转变为复杂推理引擎的关键阶段——后训练——在现有文献中仍呈现碎片化状态。本综述首次对Video-LMMs的后训练方法进行全面审视,涵盖三大核心支柱:基于思维链的监督微调(SFT)、基于可验证目标的强化学习(RL),以及通过增强推理计算实现的测试时扩展(TTS)。我们提出一个结构化分类体系,阐明这些技术的功能定位、相互关联及针对视频特性的适配方案,重点应对时序定位、时空基础、长视频处理效率和多模态证据融合等独特挑战。通过对代表性方法的系统分析,我们提炼出关键设计原则、核心洞见与评估规范,同时指出奖励机制设计、可扩展性及成本-性能优化等亟待突破的开放性问题。此外,我们系统整理了关键基准测试、数据集与评估指标,以促进后训练效果的科学评估。本综述旨在为研究者和实践者提供推进Video-LMM能力发展的统一框架。相关资源与动态更新维护于:https://github.com/yunlong10/Awesome-Video-LMM-Post-Training