By leveraging tool-augmented Multimodal Large Language Models (MLLMs), multi-agent frameworks are driving progress in video understanding. However, most of them adopt static and non-learnable tool invocation mechanisms, which limit the discovery of diverse clues essential for robust perception and reasoning regarding temporally or spatially complex videos. To address this challenge, we propose a novel Multi-agent system for video understanding, namely VideoChat-M1. Instead of using a single or fixed policy, VideoChat-M1 adopts a distinct Collaborative Policy Planning (CPP) paradigm with multiple policy agents, which comprises three key processes. (1) Policy Generation: Each agent generates its unique tool invocation policy tailored to the user's query; (2) Policy Execution: Each agent sequentially invokes relevant tools to execute its policy and explore the video content; (3) Policy Communication: During the intermediate stages of policy execution, agents interact with one another to update their respective policies. Through this collaborative framework, all agents work in tandem, dynamically refining their preferred policies based on contextual insights from peers to effectively respond to the user's query. Moreover, we equip our CPP paradigm with a concise Multi-Agent Reinforcement Learning (MARL) method. Consequently, the team of policy agents can be jointly optimized to enhance VideoChat-M1's performance, guided by both the final answer reward and intermediate collaborative process feedback. Extensive experiments demonstrate that VideoChat-M1 achieves SOTA performance across eight benchmarks spanning four tasks. Notably, on LongVideoBench, our method outperforms the SOTA model Gemini 2.5 pro by 3.6% and GPT-4o by 15.6%.
翻译:通过利用工具增强的多模态大语言模型(MLLMs),多智能体框架正在推动视频理解领域的进展。然而,现有方法大多采用静态且不可学习的工具调用机制,这限制了对时序或空间复杂视频进行鲁棒感知与推理所需多样化线索的发掘。为应对这一挑战,我们提出了一种新颖的视频理解多智能体系统——VideoChat-M1。该系统摒弃单一或固定策略,采用具有多个策略智能体的协作式策略规划(CPP)范式,包含三个关键流程:(1)策略生成:每个智能体根据用户查询生成其独特的工具调用策略;(2)策略执行:各智能体依次调用相关工具执行策略并探索视频内容;(3)策略通信:在策略执行的中间阶段,智能体通过交互更新各自策略。在此协作框架下,所有智能体协同工作,基于同伴的上下文洞察动态优化其偏好策略,以有效响应用户查询。此外,我们为CPP范式配备了简洁的多智能体强化学习(MARL)方法,使得策略智能体团队能够在最终答案奖励与中间协作过程反馈的共同指导下联合优化,从而提升VideoChat-M1的性能。大量实验表明,VideoChat-M1在涵盖四项任务的八个基准测试中均达到最先进(SOTA)性能。值得注意的是,在LongVideoBench基准上,本方法以3.6%的优势超越SOTA模型Gemini 2.5 pro,并以15.6%的优势超越GPT-4o。