We consider the problem of estimating an object's physical properties such as mass, friction, and elasticity directly from video sequences. Such a system identification problem is fundamentally ill-posed due to the loss of information during image formation. Current solutions require precise 3D labels which are labor-intensive to gather, and infeasible to create for many systems such as deformable solids or cloth. We present gradSim, a framework that overcomes the dependence on 3D supervision by leveraging differentiable multiphysics simulation and differentiable rendering to jointly model the evolution of scene dynamics and image formation. This novel combination enables backpropagation from pixels in a video sequence through to the underlying physical attributes that generated them. Moreover, our unified computation graph -- spanning from the dynamics and through the rendering process -- enables learning in challenging visuomotor control tasks, without relying on state-based (3D) supervision, while obtaining performance competitive to or better than techniques that rely on precise 3D labels.
翻译:我们考虑直接从视频序列中估算物体物理特性的问题,如质量、摩擦和弹性。这种系统识别问题由于图像形成过程中信息丢失而根本上不正确。当前的解决方案需要精确的三维标签,这些标签需要收集劳动密集型的、无法为许多系统创建的,如变形固体或布料。我们提出“梯度”这个框架,它通过利用不同多物理模拟和可变转换来联合模拟场景动态和图像形成的变化,克服对三维监督的依赖性。这种新颖的组合使得能够从视频序列中的像素反向调整,直到生成它们的基本物理属性。此外,我们的统一计算图----从动态中和形成过程----能够学习挑战性地表摩托控制任务,而不必依靠基于国家的3D监督,同时获得与精确的3D标签相比,或更好的性能竞争力。