Long video understanding (LVU) is challenging because answering real-world queries often depends on sparse, temporally dispersed cues buried in hours of mostly redundant and irrelevant content. While agentic pipelines improve video reasoning capabilities, prevailing frameworks rely on a query-agnostic captioner to perceive video information, which wastes computation on irrelevant content and blurs fine-grained temporal and spatial information. Motivated by active perception theory, we argue that LVU agents should actively decide what, when, and where to observe, and continuously assess whether the current observation is sufficient to answer the query. We present Active Video Perception (AVP), an evidence-seeking framework that treats the video as an interactive environment and acquires compact, queryrelevant evidence directly from pixels. Concretely, AVP runs an iterative plan-observe-reflect process with MLLM agents. In each round, a planner proposes targeted video interactions, an observer executes them to extract time-stamped evidence, and a reflector evaluates the sufficiency of the evidence for the query, either halting with an answer or triggering further observation. Across five LVU benchmarks, AVP achieves highest performance with significant improvements. Notably, AVP outperforms the best agentic method by 5.7% in average accuracy while only requires 18.4% inference time and 12.4% input tokens.
翻译:长视频理解(LVU)面临挑战,因为回答现实世界查询通常依赖于稀疏、时间上分散且隐藏在数小时冗余无关内容中的线索。尽管智能体化流程提升了视频推理能力,但主流框架依赖查询无关的标注器来感知视频信息,这导致计算浪费在无关内容上,并模糊了细粒度的时间与空间信息。受主动感知理论启发,我们认为LVU智能体应主动决定观察什么、何时观察以及何处观察,并持续评估当前观察是否足以回答查询。我们提出了主动视频感知(AVP),这是一种证据搜寻框架,将视频视为交互式环境,直接从像素中获取紧凑、与查询相关的证据。具体而言,AVP运行一个包含多模态大语言模型(MLLM)智能体的迭代式规划-观察-反思流程。在每一轮中,规划器提出有针对性的视频交互,观察器执行这些交互以提取带时间戳的证据,反思器则评估证据对查询的充分性,决定是停止并给出答案还是触发进一步观察。在五个LVU基准测试中,AVP实现了最高性能,且提升显著。值得注意的是,AVP在平均准确率上超越最佳智能体方法5.7%,同时仅需18.4%的推理时间和12.4%的输入令牌。