通过学习视频和语言的不同物理模型进行动态视觉思考 (Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language)

In this work, we propose a unified framework, called Visual Reasoning with Differ-entiable Physics (VRDP), that can jointly learn visual concepts and infer physics models of objects and their interactions from videos and language. This is achieved by seamlessly integrating three components: a visual perception module, a concept learner, and a differentiable physics engine. The visual perception module parses each video frame into object-centric trajectories and represents them as latent scene representations. The concept learner grounds visual concepts (e.g., color, shape, and material) from these object-centric representations based on the language, thus providing prior knowledge for the physics engine. The differentiable physics model, implemented as an impulse-based differentiable rigid-body simulator, performs differentiable physical simulation based on the grounded concepts to infer physical properties, such as mass, restitution, and velocity, by fitting the simulated trajectories into the video observations. Consequently, these learned concepts and physical models can explain what we have seen and imagine what is about to happen in future and counterfactual scenarios. Integrating differentiable physics into the dynamic reasoning framework offers several appealing benefits. More accurate dynamics prediction in learned physics models enables state-of-the-art performance on both synthetic and real-world benchmarks while still maintaining high transparency and interpretability; most notably, VRDP improves the accuracy of predictive and counterfactual questions by 4.5% and 11.5% compared to its best counterpart. VRDP is also highly data-efficient: physical parameters can be optimized from very few videos, and even a single video can be sufficient. Finally, with all physical parameters inferred, VRDP can quickly learn new concepts from a few examples.

翻译：在这项工作中,我们提出一个统一框架,称为“视觉理性与差异性物理(VRDP)”,它可以共同学习视觉概念和物体的物理模型及其从视频和语言中互动的物理模型。这是通过无缝地整合三个组成部分来实现的:视觉视觉感知模块、概念学习器和不同的物理引擎。视觉感知模块将每个视频框架都分为以物体为中心的轨迹,并把它们作为潜在的场景演示。概念学习者可以从这些基于语言的以物体为中心的表达方式(例如,颜色、形状和材料)来解释视觉概念(例如,最佳效率参数、形状和材料),从而为物理引擎提供先前的知识。不同的物理模型,作为基于冲动的物理模型、可变硬体模拟器和不同的物理引擎。视觉感知性模块将每个视频框架分为不同的物理模型,以推断物理特性为基础,如质量、恢复力和速度,将模拟的物理特性纳入视频观察中。因此,这些学习的概念和物理模型可以解释我们所看到和想象的是什么,从未来和反变的情景中将发生什么,也为物理引擎提供先前和反向反向的精确的精确性参数。将更精确的精确的精确的物理模型纳入的精确性模型,同时将一些最终的理论解释概念和精确的理论的理论的理论的理论的理论的理论的理论的理论的原理框架,将一些的理论的原理的理论的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理的原理

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【杜克-Bhuwan Dhingra】语言模型即知识图谱，46页ppt

专知会员服务

67+阅读 · 2021年11月15日

【快讯】ICML 2020论文出炉，1088篇上榜，你的paper中了吗？

专知会员服务

52+阅读 · 2020年6月1日

【论文翻译】NLP注意力机制综述论文翻译，Attention, please! A Critical Review of Neural Attention Models in Natural Language Processing

专知会员服务

96+阅读 · 2020年4月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日