Long-form video understanding remains challenging due to the extended temporal structure and dense multimodal cues. Despite recent progress, many existing approaches still rely on hand-crafted reasoning pipelines or employ token-consuming video preprocessing to guide MLLMs in autonomous reasoning. To overcome these limitations, we introduce VideoARM, an Agentic Reasoning-over-hierarchical-Memory paradigm for long-form video understanding. Instead of static, exhaustive preprocessing, VideoARM performs adaptive, on-the-fly agentic reasoning and memory construction. Specifically, VideoARM performs an adaptive and continuous loop of observing, thinking, acting, and memorizing, where a controller autonomously invokes tools to interpret the video in a coarse-to-fine manner, thereby substantially reducing token consumption. In parallel, a hierarchical multimodal memory continuously captures and updates multi-level clues throughout the operation of the agent, providing precise contextual information to support the controller in decision-making. Experiments on prevalent benchmarks demonstrate that VideoARM outperforms the state-of-the-art method, DVD, while significantly reducing token consumption for long-form videos.
翻译:由于视频具有延展的时间结构和密集的多模态线索,长视频理解仍然面临挑战。尽管近期取得进展,许多现有方法仍依赖于手工设计的推理流程,或采用消耗大量令牌的视频预处理来指导多模态大语言模型进行自主推理。为克服这些限制,我们提出了VideoARM,一种面向长视频理解的基于分层记忆的智能推理范式。与静态、穷举式的预处理不同,VideoARM执行自适应的、即时性的智能推理与记忆构建。具体而言,VideoARM通过观察、思考、行动与记忆的自适应连续循环,由控制器自主调用工具以从粗到细的方式解析视频,从而显著降低令牌消耗。同时,一个分层多模态记忆在智能体运行过程中持续捕获并更新多层级线索,为控制器决策提供精确的上下文信息。在主流基准测试上的实验表明,VideoARM在长视频理解任务上优于当前最先进方法DVD,同时显著减少了长视频处理的令牌消耗。