While multimodal large language models (MLLMs) exhibit strong performance on single-video tasks (e.g., video question answering), their capability for spatiotemporal pattern reasoning across multiple videos remains a critical gap in pattern recognition research. However, this capability is essential for real-world applications, including multi-camera surveillance and cross-video procedural learning. To bridge this gap, we present CVBench, the first diagnostic benchmark designed to assess cross-video relational reasoning rigorously. CVBench comprises 1,000 question-answer pairs spanning three hierarchical tiers: cross-video object association (identifying shared entities), cross-video event association (linking temporal or causal event chains), and cross-video complex reasoning (integrating commonsense and domain knowledge). Built from five domain-diverse video clusters (e.g., sports, life records), the benchmark challenges models to analyze and integrate spatiotemporal patterns from dynamic visual streams. Extensive evaluation of 10+ leading MLLMs (including GPT-4o, Gemini-2.0-flash, Qwen2.5-VL) under zero-shot or chain-of-thought prompting paradigms. Key findings reveal stark performance gaps: even top models, such as GPT-4o, achieve only 63.5% accuracy on causal reasoning tasks, compared to the 91.3% accuracy of human performance. Crucially, our analysis reveals fundamental bottlenecks inherent in current MLLMs architectures, notably deficient inter-video context retention and poor disambiguation of overlapping entities. CVBench establishes a rigorous framework for advancing pattern recognition methodologies in multi-video scenarios, providing architectural insights for next-generation models. The data and evaluation code are available at: https://github.com/Hokhim2/CVBench.
翻译:尽管多模态大语言模型(MLLMs)在单视频任务(例如视频问答)上表现出色,但其跨多个视频进行时空模式推理的能力,仍然是模式识别研究中的一个关键空白。然而,这种能力对于现实世界的应用至关重要,包括多摄像头监控和跨视频程序学习。为了弥补这一差距,我们提出了CVBench,这是首个旨在严格评估跨视频关系推理的诊断性基准。CVBench包含1000个问答对,涵盖三个层次:跨视频对象关联(识别共享实体)、跨视频事件关联(链接时序或因果事件链)以及跨视频复杂推理(整合常识和领域知识)。该基准基于五个领域多样的视频集群(例如体育、生活记录)构建,挑战模型从动态视觉流中分析和整合时空模式的能力。我们对10多个领先的MLLMs(包括GPT-4o、Gemini-2.0-flash、Qwen2.5-VL)在零样本或思维链提示范式下进行了广泛评估。关键发现揭示了显著的性能差距:即使在因果推理任务上,顶级模型(如GPT-4o)的准确率也仅为63.5%,而人类表现准确率为91.3%。至关重要的是,我们的分析揭示了当前MLLMs架构固有的根本瓶颈,特别是视频间上下文保持能力不足以及对重叠实体的消歧能力差。CVBench为推进多视频场景下的模式识别方法建立了一个严谨的框架,并为下一代模型提供了架构设计洞见。数据和评估代码可在以下网址获取:https://github.com/Hokhim2/CVBench。