Recent advances in video generation models have sparked interest in world models capable of simulating realistic environments. While navigation has been well-explored, physically meaningful interactions that mimic real-world forces remain largely understudied. In this work, we investigate using physical forces as a control signal for video generation and propose force prompts which enable users to interact with images through both localized point forces, such as poking a plant, and global wind force fields, such as wind blowing on fabric. We demonstrate that these force prompts can enable videos to respond realistically to physical control signals by leveraging the visual and motion prior in the original pretrained model, without using any 3D asset or physics simulator at inference. The primary challenge of force prompting is the difficulty in obtaining high quality paired force-video training data, both in the real world due to the difficulty of obtaining force signals, and in synthetic data due to limitations in the visual quality and domain diversity of physics simulators. Our key finding is that video generation models can generalize remarkably well when adapted to follow physical force conditioning from videos synthesized by Blender, even with limited demonstrations of few objects. Our method can generate videos which simulate forces across diverse geometries, settings, and materials. We also try to understand the source of this generalization and perform ablations that reveal two key elements: visual diversity and the use of specific text keywords during training. Our approach is trained on only around 15k training examples for a single day on four A100 GPUs, and outperforms existing methods on force adherence and physics realism, bringing world models closer to real-world physics interactions. We release all datasets, code, weights, and interactive video demos at our project page.
翻译:视频生成模型的最新进展激发了人们对能够模拟真实环境的世界模型的兴趣。尽管导航任务已得到充分探索,但模仿真实世界物理力的、具有物理意义的交互在很大程度上仍未得到充分研究。在本工作中,我们研究了将物理力作为视频生成的控制信号,并提出力提示,使用户能够通过局部点力(例如戳植物)和全局风力场(例如风吹布料)与图像进行交互。我们证明,这些力提示能够利用原始预训练模型中的视觉和运动先验,使视频对物理控制信号做出逼真的响应,而无需在推理时使用任何3D资产或物理模拟器。力提示的主要挑战在于难以获取高质量的成对力-视频训练数据:在现实世界中,由于难以获取力信号;在合成数据中,则受限于物理模拟器的视觉质量和领域多样性。我们的关键发现是,视频生成模型在适应遵循由Blender合成的视频中的物理力条件时,即使仅有少数物体的有限演示,也能表现出卓越的泛化能力。我们的方法能够生成模拟不同几何形状、场景和材料上受力的视频。我们还尝试理解这种泛化能力的来源,并通过消融实验揭示了两个关键要素:视觉多样性以及在训练中使用特定的文本关键词。我们的方法仅使用约1.5万个训练示例,在四块A100 GPU上训练一天,就在力遵循度和物理真实性方面超越了现有方法,使世界模型更接近真实世界的物理交互。我们在项目页面发布了所有数据集、代码、权重和交互式视频演示。