In this work, we propose a unified framework, called Visual Reasoning with Differ-entiable Physics (VRDP), that can jointly learn visual concepts and infer physics models of objects and their interactions from videos and language. This is achieved by seamlessly integrating three components: a visual perception module, a concept learner, and a differentiable physics engine. The visual perception module parses each video frame into object-centric trajectories and represents them as latent scene representations. The concept learner grounds visual concepts (e.g., color, shape, and material) from these object-centric representations based on the language, thus providing prior knowledge for the physics engine. The differentiable physics model, implemented as an impulse-based differentiable rigid-body simulator, performs differentiable physical simulation based on the grounded concepts to infer physical properties, such as mass, restitution, and velocity, by fitting the simulated trajectories into the video observations. Consequently, these learned concepts and physical models can explain what we have seen and imagine what is about to happen in future and counterfactual scenarios. Integrating differentiable physics into the dynamic reasoning framework offers several appealing benefits. More accurate dynamics prediction in learned physics models enables state-of-the-art performance on both synthetic and real-world benchmarks while still maintaining high transparency and interpretability; most notably, VRDP improves the accuracy of predictive and counterfactual questions by 4.5% and 11.5% compared to its best counterpart. VRDP is also highly data-efficient: physical parameters can be optimized from very few videos, and even a single video can be sufficient. Finally, with all physical parameters inferred, VRDP can quickly learn new concepts from a few examples.
翻译:在这项工作中,我们提出一个统一框架,称为“视觉理性与差异性物理(VRDP)”,它可以共同学习视觉概念和物体的物理模型及其从视频和语言中互动的物理模型。这是通过无缝地整合三个组成部分来实现的:视觉视觉感知模块、概念学习器和不同的物理引擎。视觉感知模块将每个视频框架都分为以物体为中心的轨迹,并把它们作为潜在的场景演示。概念学习者可以从这些基于语言的以物体为中心的表达方式(例如,颜色、形状和材料)来解释视觉概念(例如,最佳效率参数、形状和材料),从而为物理引擎提供先前的知识。不同的物理模型,作为基于冲动的物理模型、可变硬体模拟器和不同的物理引擎。视觉感知性模块将每个视频框架分为不同的物理模型,以推断物理特性为基础,如质量、恢复力和速度,将模拟的物理特性纳入视频观察中。因此,这些学习的概念和物理模型可以解释我们所看到和想象的是什么,从未来和反变的情景中将发生什么,也为物理引擎提供先前和反向反向的精确的精确性参数。将更精确的精确的精确的物理模型纳入的精确性模型,同时将一些最终的理论解释概念和精确的理论的理论的理论的理论的理论的理论的理论的理论的原理框架,将一些的理论的原理的理论的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理