Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .
翻译:大型多模态模型(LMMs)在结合文本思维链进行视频推理方面展现出巨大潜力,但其仍易产生幻觉,尤其在处理证据稀疏且时间分散的长视频时。受人类理解长视频的方式启发——先全局浏览,再审视相关片段获取细节——我们提出了LongVT,一种端到端的智能体框架,通过交错的多模态工具链思维实现“长视频思考”。具体而言,我们利用LMMs固有的时序定位能力作为原生视频裁剪工具,以放大特定视频片段并重新采样更细粒度的视频帧。这种从全局到局部的推理循环持续进行,直至答案基于检索到的视觉证据得到验证。针对长视频推理任务中细粒度问答(QA)数据的稀缺性,我们构建并将发布名为VideoSIAH的数据套件,以促进训练和评估。具体来说,我们的训练数据集包含24.79万个样本用于工具集成的冷启动监督微调,1.6千个样本用于智能体强化学习,以及15.4千个样本用于智能体强化微调。我们的评估基准包含1,280个QA对,这些数据通过半自动数据流水线结合人工循环验证精心构建。通过精心设计的三阶段训练策略和广泛的实证验证,LongVT在四个具有挑战性的长视频理解与推理基准测试中持续优于现有强基线。我们的代码、数据和模型检查点已在https://github.com/EvolvingLMMs-Lab/LongVT 公开提供。