We focus on the task of estimating a physically plausible articulated human motion from monocular video. Existing approaches that do not consider physics often produce temporally inconsistent output with motion artifacts, while state-of-the-art physics-based approaches have either been shown to work only in controlled laboratory conditions or consider simplified body-ground contact limited to feet. This paper explores how these shortcomings can be addressed by directly incorporating a fully-featured physics engine into the pose estimation process. Given an uncontrolled, real-world scene as input, our approach estimates the ground-plane location and the dimensions of the physical body model. It then recovers the physical motion by performing trajectory optimization. The advantage of our formulation is that it readily generalizes to a variety of scenes that might have diverse ground properties and supports any form of self-contact and contact between the articulated body and scene geometry. We show that our approach achieves competitive results with respect to existing physics-based methods on the Human3.6M benchmark, while being directly applicable without re-training to more complex dynamic motions from the AIST benchmark and to uncontrolled internet videos.
翻译:我们的重点是从单视录象中估计一种实际可信的人的运动。不考虑物理学的现有方法往往产生与运动文物在时间上不一致的输出,而最先进的物理学方法要么证明只在受控制的实验室条件下工作,要么考虑简单的身体-地面接触仅限于脚步。本文探讨了如何通过直接将全功能物理引擎纳入到成份估测过程来克服这些缺点。鉴于一个不受控制的、真实世界的场景作为投入,我们的方法估计了地面飞机的位置和物理体模型的尺寸。然后,我们的方法通过进行轨迹优化来恢复物理运动。我们的方法的优点是,它很容易地概括到各种可能具有不同地面特性的场景,并且支持形形体和场形体之间的任何形式的自接触和接触。我们表明,我们的方法在人体3.6M基准上的现有物理学方法上取得了竞争性的结果,同时不直接用于从应用软件信息系统基准和不受控制的互联网视频上进行更复杂的动态运动,而无需再培训。