视频-大语言模型后训练：深入探讨大语言模型在视频推理中的应用 (Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models)

Yunlong Tang,Jing Bi,Pinxin Liu,Zhenyu Pan,Zhangyun Tan,Qianxiang Shen,Jiani Liu,Hang Hua,Junjia Guo,Yunzhong Xiao,Chao Huang,Zhiyuan Wang,Susan Liang,Xinyi Liu,Yizhi Song,Yuhe Nie,Jia-Xing Zhong,Bozheng Li,Daiqing Qi,Ziyun Zeng,Ali Vosoughi,Luchuan Song,Zeliang Zhang,Daiki Shimada,Han Liu,Jiebo Luo,Chenliang Xu

from arxiv, The 1st version

Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training

翻译：视频理解是计算机视觉领域最具挑战性的前沿方向，要求模型能够对复杂的时空关系、长期依赖性和多模态证据进行推理。近期兴起的视频-大语言模型（Video-LMMs）将视觉编码器与强大的基于解码器的语言模型相结合，在视频理解任务中展现出卓越能力。然而，将这些模型从基础感知系统转变为复杂推理引擎的关键阶段——后训练——在现有文献中仍呈碎片化状态。本综述首次对Video-LMMs的后训练方法进行全面审视，涵盖三大核心支柱：基于思维链的监督微调（SFT）、基于可验证目标的强化学习（RL），以及通过增强推理计算实现的测试时扩展（TTS）。我们提出一个结构化分类体系，阐明这些技术在视频领域的具体角色、相互联系与适应性调整，重点应对时序定位、时空基础、长视频处理效率及多模态证据融合等独特挑战。通过对代表性方法的系统分析，我们提炼出关键设计原则、核心见解与评估规范，同时指出奖励机制设计、可扩展性及成本-性能优化等亟待突破的开放性问题。此外，我们系统整理了关键基准测试、数据集与评估指标，以促进后训练效果的科学评估。本综述旨在为研究者和实践者提供推进Video-LMM能力发展的统一框架。相关资源与动态更新维护于：https://github.com/yunlong10/Awesome-Video-LMM-Post-Training

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日