Ego-Head姿态估计实现Ego-Body姿态估计 (Ego-Body Pose Estimation via Ego-Head Pose Estimation)

Estimating 3D human motion from an egocentric video sequence plays a critical role in human behavior understanding and has various applications in VR/AR. However, naively learning a mapping between egocentric videos and human motions is challenging, because the user's body is often unobserved by the front-facing camera placed on the head of the user. In addition, collecting large-scale, high-quality datasets with paired egocentric videos and 3D human motions requires accurate motion capture devices, which often limit the variety of scenes in the videos to lab-like environments. To eliminate the need for paired egocentric video and human motions, we propose a new method, Ego-Body Pose Estimation via Ego-Head Pose Estimation (EgoEgo), which decomposes the problem into two stages, connected by the head motion as an intermediate representation. EgoEgo first integrates SLAM and a learning approach to estimate accurate head motion. Subsequently, leveraging the estimated head pose as input, EgoEgo utilizes conditional diffusion to generate multiple plausible full-body motions. This disentanglement of head and body pose eliminates the need for training datasets with paired egocentric videos and 3D human motion, enabling us to leverage large-scale egocentric video datasets and motion capture datasets separately. Moreover, for systematic benchmarking, we develop a synthetic dataset, AMASS-Replica-Ego-Syn (ARES), with paired egocentric videos and human motion. On both ARES and real data, our EgoEgo model performs significantly better than the current state-of-the-art methods.

翻译：估计3D人体运动是人类行为理解中至关重要的步骤，在虚拟现实/增强现实中有着广泛的应用。然而，仅学习egocentric视频和人体动作之间的映射是具有挑战性的，因为前置摄像头仅能观察到由用户的头部运作，用户的身体通常无法被观测到。此外，收集大规模、高质量的带有配对的egocentric视频和3D人体运动的数据集需要精确的运动捕捉设备，这经常会限制所拍摄的视频场景的多样性，仅仅限于实验室环境。为了消除所需的 egocentric视频和人体动作数据集的配对，我们提出了一种新的方法，EgoEgo，通过使用中间表征——头部运动将问题分解为两个阶段。在第一阶段，EgoEgo利用SLAM和学习方法集成以估计准确的头部运动。然后，利用所估算的头部姿态为输入，EgoEgo利用条件扩散来生成多个合理的全身运动。该头部和身体姿态的分离消除了需要训练包含配对的egocentric视频和3D人体动作的数据集，使我们能够分别利用大规模egocentric视频数据集和运动捕捉数据集。此外，为了进行系统性的基准测试，我们开发了一个合成数据集AMASS-Replica-Ego-Syn（ARES），其中包含了带有配对的egocentric视频和人体运动。在ARES和真实数据上，我们的EgoEgo模型表现显着优于当前最先进的方法。