Long video understanding remains challenging due to its complex, diverse, and temporally scattered content. Although video large language models (Video-LLMs) can process videos lasting tens of minutes, applying them to truly long sequences is computationally prohibitive and often leads to unfocused or inconsistent reasoning. A promising solution is to select only the most informative frames, yet existing approaches typically ignore temporal dependencies or rely on unimodal evidence, limiting their ability to provide complete and query-relevant context. We propose a Semantic-Visual Consensus Evidence Selection (SeViCES) framework for effective and reliable long video understanding. SeViCES is training-free and model-agnostic, and introduces two key components. The Semantic-Visual Consensus Frame Selection (SVCFS) module selects frames through (1) a temporal-aware semantic branch that leverages LLM reasoning over captions, and (2) a cluster-guided visual branch that aligns embeddings with semantic scores via mutual information. The Answer Consensus Refinement (ACR) module further resolves inconsistencies between semantic- and visual-based predictions by fusing evidence and constraining the answer space. Extensive experiments on long video understanding benchmarks show that SeViCES consistently outperforms state-of-the-art methods in both accuracy and robustness, demonstrating the importance of consensus-driven evidence selection for Video-LLMs.
翻译:长视频理解因其内容复杂、多样且时间分散而持续面临挑战。尽管视频大语言模型(Video-LLMs)能够处理长达数十分钟的视频,但将其应用于真正长序列在计算上代价高昂,且常导致推理过程失焦或不一致。一种有前景的解决方案是仅选择信息量最大的帧,然而现有方法通常忽略时序依赖或仅依赖单模态证据,限制了其提供完整且与查询相关上下文的能力。本文提出一种语义-视觉共识证据选择(SeViCES)框架,用于实现高效可靠的长视频理解。SeViCES无需训练且与模型无关,其包含两个核心组件:语义-视觉共识帧选择(SVCFS)模块通过(1)利用大语言模型对描述文本进行推理的时序感知语义分支,以及(2)通过互信息将视觉嵌入与语义分数对齐的聚类引导视觉分支来选取关键帧;答案共识优化(ACR)模块则通过融合证据并约束答案空间,进一步解决基于语义和视觉的预测间的不一致性。在长视频理解基准上的大量实验表明,SeViCES在准确性与鲁棒性上均持续优于现有最优方法,验证了共识驱动证据选择对Video-LLMs的重要性。