Human motion synthesis is an important problem with applications in graphics, gaming and simulation environments for robotics. Existing methods require accurate motion capture data for training, which is costly to obtain. Instead, we propose a framework for training generative models of physically plausible human motion directly from monocular RGB videos, which are much more widely available. At the core of our method is a novel optimization formulation that corrects imperfect image-based pose estimations by enforcing physics constraints and reasons about contacts in a differentiable way. This optimization yields corrected 3D poses and motions, as well as their corresponding contact forces. Results show that our physically-corrected motions significantly outperform prior work on pose estimation. We can then use these to train a generative model to synthesize future motion. We demonstrate both qualitatively and quantitatively improved motion estimation, synthesis quality and physical plausibility achieved by our method on the Human3.6m dataset~\cite{h36m_pami} as compared to prior kinematic and physics-based methods. By enabling learning of motion synthesis from video, our method paves the way for large-scale, realistic and diverse motion synthesis. Project page: \url{https://nv-tlabs.github.io/publication/iccv_2021_physics/}
翻译:人类运动合成是图象、游戏和模拟机器人环境应用方面的一个重要问题。现有方法需要准确的动作捕获数据以备培训使用,但费用昂贵。相反,我们提议一个框架,直接从单镜 RGB 视频直接培训人运动体貌合理的基因模型,这种模型可广泛使用。我们方法的核心是一种新颖的优化配方,它通过以不同方式执行物理限制和接触原因来纠正不完美的图像构成估计。这种优化效果修正了 3D 构成和运动,以及相应的接触力量。结果显示,我们实际修正的动作大大超出了先前关于作出估计的工作。然后,我们可以利用这些模型来训练一种基因模型,以综合未来运动。我们展示了我们在人类3.6m数据集中采用的方法在质量和数量上都有所改善、合成质量和物理上的光度,与先前的运动和物理方法相比较。通过从视频学习动作合成,我们的方法为大规模、现实和多样化的运动合成铺平面铺平了道路。项目页:=urburburbs/pubralgius_publimas_pubralgius.