Recent breakthroughs in video generation have demonstrated an emerging capability termed Chain-of-Frames (CoF) reasoning, where models resolve complex tasks through the generation of continuous frames. While these models show promise for Generative Video Reasoning (GVR), existing evaluation frameworks often rely on single-frame assessments, which can lead to outcome-hacking, where a model reaches a correct conclusion through an erroneous process. To address this, we propose a process-aware evaluation paradigm. We introduce VIPER, a comprehensive benchmark spanning 16 tasks across temporal, structural, symbolic, spatial, physics, and planning reasoning. Furthermore, we propose Process-outcome Consistency (POC@r), a new metric that utilizes VLM-as-Judge with a hierarchical rubric to evaluate both the validity of the intermediate steps and the final result. Our experiments reveal that state-of-the-art video models achieve only about 20% POC@1.0 and exhibit a significant outcome-hacking. We further explore the impact of test-time scaling and sampling robustness, highlighting a substantial gap between current video generation and true generalized visual reasoning. Our benchmark will be publicly released.
翻译:近期视频生成领域的突破性进展展示了一种新兴能力,即帧链推理,模型通过生成连续帧来解决复杂任务。尽管这些模型在生成式视频推理方面展现出潜力,但现有评估框架通常依赖于单帧评估,这可能导致结果作弊,即模型通过错误过程得出正确结论。为解决这一问题,我们提出了一种过程感知评估范式。我们引入了VIPER,这是一个涵盖时间、结构、符号、空间、物理和规划推理等16个任务的综合性基准。此外,我们提出了过程-结果一致性指标,该指标采用VLM-as-Judge分级评估框架,同时评估中间步骤的有效性和最终结果。实验表明,最先进的视频模型仅能达到约20%的POC@1.0,并表现出显著的结果作弊现象。我们进一步探究了测试时扩展和采样鲁棒性的影响,揭示了当前视频生成与真正广义视觉推理之间存在显著差距。本基准将公开发布。