This is a short technical report describing the winning entry of the PhysicsIQ Challenge, presented at the Perception Test Workshop at ICCV 2025. State-of-the-art video generative models exhibit severely limited physical understanding, and often produce implausible videos. The Physics IQ benchmark has shown that visual realism does not imply physics understanding. Yet, intuitive physics understanding has shown to emerge from SSL pretraining on natural videos. In this report, we investigate whether we can leverage SSL-based video world models to improve the physics plausibility of video generative models. In particular, we build ontop of the state-of-the-art video generative model MAGI-1 and couple it with the recently introduced Video Joint Embedding Predictive Architecture 2 (VJEPA-2) to guide the generation process. We show that by leveraging VJEPA-2 as reward signal, we can improve the physics plausibility of state-of-the-art video generative models by ~6%.
翻译:本技术简报描述了ICCV 2025感知测试研讨会上公布的PhysicsIQ挑战赛优胜方案。当前最先进的视频生成模型对物理规律的理解存在严重局限,常生成不符合物理真实的视频。PhysicsIQ基准测试表明,视觉真实感并不等同于物理理解。然而,自监督学习在自然视频上的预训练已被证明能自发形成直觉物理理解。本报告探讨了能否利用基于自监督学习的视频世界模型来提升视频生成模型的物理合理性。具体而言,我们在当前最先进的视频生成模型MAGI-1基础上,结合最新提出的视频联合嵌入预测架构2(VJEPA-2)来引导生成过程。实验表明,通过采用VJEPA-2作为奖励信号,可将前沿视频生成模型的物理合理性提升约6%。