场景感知的单目摄像机三维多人运动捕捉 (Scene-Aware 3D Multi-Human Motion Capture from a Single Camera)

from arxiv, Accepted to Eurographics 2023. See also github: https://github.com/dluvizon/scene-aware-3d-multi-human project page: https://vcai.mpi-inf.mpg.de/projects/scene-aware-3d-multi-human/

In this work, we consider the problem of estimating the 3D position of multiple humans in a scene as well as their body shape and articulation from a single RGB video recorded with a static camera. In contrast to expensive marker-based or multi-view systems, our lightweight setup is ideal for private users as it enables an affordable 3D motion capture that is easy to install and does not require expert knowledge. To deal with this challenging setting, we leverage recent advances in computer vision using large-scale pre-trained models for a variety of modalities, including 2D body joints, joint angles, normalized disparity maps, and human segmentation masks. Thus, we introduce the first non-linear optimization-based approach that jointly solves for the absolute 3D position of each human, their articulated pose, their individual shapes as well as the scale of the scene. In particular, we estimate the scene depth and person unique scale from normalized disparity predictions using the 2D body joints and joint angles. Given the per-frame scene depth, we reconstruct a point-cloud of the static scene in 3D space. Finally, given the per-frame 3D estimates of the humans and scene point-cloud, we perform a space-time coherent optimization over the video to ensure temporal, spatial and physical plausibility. We evaluate our method on established multi-person 3D human pose benchmarks where we consistently outperform previous methods and we qualitatively demonstrate that our method is robust to in-the-wild conditions including challenging scenes with people of different sizes.

翻译：在这项工作中，我们考虑从单个RGB视频中估计场景中多个人的三维位置以及他们的身体形状和关节运动，而无需使用昂贵的基于标记或多视角系统。相反，我们的轻量级设置非常适合私人用户，因为它提供了一种经济实惠的3D动作捕捉方案，易于安装并且不需要专业知识。为了应对这个具有挑战性的环境，我们利用了计算机视觉领域中最近的一些成果，如大规模预训练模型的多种模态，包括二维身体关节点、关节角度、归一化视差图以及人体分割掩模。因此，我们提出了第一个基于非线性优化的方法，联合解决每个人的绝对三维位置、关节姿势、形状以及场景尺度。特别地，我们使用二维身体关节和关节角度来估计场景深度和人物独特的尺度。给定每帧的场景深度，我们在三维空间中重建场景点云。最后，在给定每帧的三维人体估计和场景点云的情况下，我们对视频进行时空一致性优化，以确保时空和物理的合理性。我们在经过验证的多人3D人体姿态基准测试中评估了我们的方法，在其中我们始终优于以前的方法，并且我们在完全开放的环境中展示了我们的方法对各种各样的场景，包括尺寸不同的人物，具有较强的鲁棒性。