The rapid evolution of video generative models has shifted their focus from producing visually plausible outputs to tackling tasks requiring physical plausibility and logical consistency. However, despite recent breakthroughs such as Veo 3's chain-of-frames reasoning, it remains unclear whether these models can exhibit reasoning capabilities similar to large language models (LLMs). Existing benchmarks predominantly evaluate visual fidelity and temporal coherence, failing to capture higher-order reasoning abilities. To bridge this gap, we propose TiViBench, a hierarchical benchmark specifically designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models. TiViBench systematically assesses reasoning across four dimensions: i) Structural Reasoning & Search, ii) Spatial & Visual Pattern Reasoning, iii) Symbolic & Logical Reasoning, and iv) Action Planning & Task Execution, spanning 24 diverse task scenarios across 3 difficulty levels. Through extensive evaluations, we show that commercial models (e.g., Sora 2, Veo 3.1) demonstrate stronger reasoning potential, while open-source models reveal untapped potential that remains hindered by limited training scale and data diversity. To further unlock this potential, we introduce VideoTPO, a simple yet effective test-time strategy inspired by preference optimization. By performing LLM self-analysis on generated candidates to identify strengths and weaknesses, VideoTPO significantly enhances reasoning performance without requiring additional training, data, or reward models. Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models, setting a foundation for future research in this emerging field.
翻译:视频生成模型的快速发展已将其关注点从生成视觉上合理的输出转向处理需要物理合理性和逻辑一致性的任务。然而,尽管近期出现了诸如Veo 3的帧链推理等突破,这些模型是否能展现出类似大语言模型(LLMs)的推理能力仍不明确。现有基准测试主要评估视觉保真度和时序连贯性,未能捕捉高阶推理能力。为填补这一空白,我们提出了TiViBench,一个专门设计用于评估图像到视频(I2V)生成模型推理能力的分层基准测试。TiViBench系统性地评估四个维度的推理:i) 结构推理与搜索,ii) 空间与视觉模式推理,iii) 符号与逻辑推理,以及iv) 动作规划与任务执行,涵盖3个难度级别下的24种多样化任务场景。通过广泛评估,我们发现商业模型(如Sora 2、Veo 3.1)展现出更强的推理潜力,而开源模型则揭示了因训练规模和数据多样性有限而受抑制的未开发潜力。为进一步释放这一潜力,我们引入了VideoTPO,一种受偏好优化启发的简单而有效的测试时策略。通过对生成的候选视频进行LLM自我分析以识别优缺点,VideoTPO在无需额外训练、数据或奖励模型的情况下,显著提升了推理性能。TiViBench与VideoTPO共同为评估和推进视频生成模型的推理能力铺平了道路,为这一新兴领域的未来研究奠定了基础。