In this work, we consider the problem of estimating the 3D position of multiple humans in a scene as well as their body shape and articulation from a single RGB video recorded with a static camera. In contrast to expensive marker-based or multi-view systems, our lightweight setup is ideal for private users as it enables an affordable 3D motion capture that is easy to install and does not require expert knowledge. To deal with this challenging setting, we leverage recent advances in computer vision using large-scale pre-trained models for a variety of modalities, including 2D body joints, joint angles, normalized disparity maps, and human segmentation masks. Thus, we introduce the first non-linear optimization-based approach that jointly solves for the absolute 3D position of each human, their articulated pose, their individual shapes as well as the scale of the scene. In particular, we estimate the scene depth and person unique scale from normalized disparity predictions using the 2D body joints and joint angles. Given the per-frame scene depth, we reconstruct a point-cloud of the static scene in 3D space. Finally, given the per-frame 3D estimates of the humans and scene point-cloud, we perform a space-time coherent optimization over the video to ensure temporal, spatial and physical plausibility. We evaluate our method on established multi-person 3D human pose benchmarks where we consistently outperform previous methods and we qualitatively demonstrate that our method is robust to in-the-wild conditions including challenging scenes with people of different sizes.
翻译:--
在这项工作中,我们考虑从单个静态相机拍摄的RGB视频中估计场景中多个人的3D位置以及其身体形状和关节运动。与昂贵的基于标记或多视图系统相比,我们轻量级的设置非常适合私人用户,因为它提供了一种经济实惠、易于安装并且不需要专业知识的3D运动捕捉方式。为了应对这个具有挑战性的情形,我们利用计算机视觉领域最新的大规模预训练模型,包括二维身体关节、关节角度、归一化视差图和人类分割掩码。因此,我们引入了第一个基于非线性优化的方法,联合求解每个人的绝对3D位置、关节运动、个体形状以及场景尺度。特别地,我们利用二维身体关节和关节角度来从归一化视差预测中估计场景深度和人类独特尺度。给定每帧场景深度,我们在3D空间中重构静态场景的点云。最后,给定每帧人类3D估计和场景点云,我们在视频中进行时空一致的优化,以确保时间、空间和物理上的合理性。我们在已有的多人3D人体姿态基准测试中评估了我们的方法,在此类测试中,我们始终优于先前的方法,并且定性地证明了我们的方法对于不同大小的人员在复杂的现实场景下具有健壮性。