Properly evaluating the ability of Video-Language Models (VLMs) to understand long videos remains a challenge. We propose a long-context video understanding benchmark, Causal2Needles, that assesses two crucial abilities insufficiently addressed by existing benchmarks: (1) extracting information from two separate locations (two needles) in a long video and understanding them jointly, and (2) modeling the world in terms of cause and effect in human behaviors. Causal2Needles evaluates these abilities using noncausal one-needle, causal one-needle, and causal two-needle questions. The most complex question type, causal two-needle questions, require extracting information from both the cause and effect events from a long video and the associated narration text. To prevent textual bias, we introduce two complementary question formats: locating the video clip containing the answer, and verbal description of a visual detail from that video clip. Our experiments reveal that models excelling on existing benchmarks struggle with causal 2-needle questions, and the model performance is negatively correlated with the distance between the two needles. These findings highlight critical limitations in current VLMs. The dataset is available at: https://huggingface.co/datasets/causal2needles/Causal2Needles
翻译:正确评估视频语言模型(VLMs)理解长视频的能力仍然是一个挑战。我们提出了一个长上下文视频理解基准——Causal2Needles,用于评估现有基准未能充分解决的两个关键能力:(1) 从长视频中两个独立位置(两根“针”)提取信息并进行联合理解;(2) 对人类行为进行因果关系的世界建模。Causal2Needles通过非因果单针问题、因果单针问题以及因果双针问题来评估这些能力。其中最复杂的问题类型——因果双针问题,要求从长视频及其伴随的叙述文本中,同时提取原因事件和结果事件的信息。为避免文本偏见,我们引入了两种互补的问题格式:定位包含答案的视频片段,以及对该视频片段中视觉细节的文字描述。我们的实验表明,在现有基准上表现优异的模型在处理因果双针问题时遇到困难,且模型性能与两根“针”之间的距离呈负相关。这些发现揭示了当前VLMs的关键局限性。数据集发布于:https://huggingface.co/datasets/causal2needles/Causal2Needles