Estimating 3D human motion from an egocentric video sequence plays a critical role in human behavior understanding and has various applications in VR/AR. However, naively learning a mapping between egocentric videos and human motions is challenging, because the user's body is often unobserved by the front-facing camera placed on the head of the user. In addition, collecting large-scale, high-quality datasets with paired egocentric videos and 3D human motions requires accurate motion capture devices, which often limit the variety of scenes in the videos to lab-like environments. To eliminate the need for paired egocentric video and human motions, we propose a new method, Ego-Body Pose Estimation via Ego-Head Pose Estimation (EgoEgo), which decomposes the problem into two stages, connected by the head motion as an intermediate representation. EgoEgo first integrates SLAM and a learning approach to estimate accurate head motion. Subsequently, leveraging the estimated head pose as input, EgoEgo utilizes conditional diffusion to generate multiple plausible full-body motions. This disentanglement of head and body pose eliminates the need for training datasets with paired egocentric videos and 3D human motion, enabling us to leverage large-scale egocentric video datasets and motion capture datasets separately. Moreover, for systematic benchmarking, we develop a synthetic dataset, AMASS-Replica-Ego-Syn (ARES), with paired egocentric videos and human motion. On both ARES and real data, our EgoEgo model performs significantly better than the current state-of-the-art methods.
翻译:估计3D人体运动是人类行为理解中至关重要的步骤,在虚拟现实/增强现实中有着广泛的应用。然而,仅学习egocentric视频和人体动作之间的映射是具有挑战性的,因为前置摄像头仅能观察到由用户的头部运作,用户的身体通常无法被观测到。此外,收集大规模、高质量的带有配对的egocentric视频和3D人体运动的数据集需要精确的运动捕捉设备,这经常会限制所拍摄的视频场景的多样性,仅仅限于实验室环境。为了消除所需的 egocentric视频和人体动作数据集的配对,我们提出了一种新的方法,EgoEgo,通过使用中间表征——头部运动将问题分解为两个阶段。在第一阶段,EgoEgo利用SLAM和学习方法集成以估计准确的头部运动。然后,利用所估算的头部姿态为输入,EgoEgo利用条件扩散来生成多个合理的全身运动。该头部和身体姿态的分离消除了需要训练包含配对的egocentric视频和3D人体动作的数据集,使我们能够分别利用大规模egocentric视频数据集和运动捕捉数据集。此外,为了进行系统性的基准测试,我们开发了一个合成数据集AMASS-Replica-Ego-Syn(ARES),其中包含了带有配对的egocentric视频和人体运动。在ARES和真实数据上,我们的EgoEgo模型表现显着优于当前最先进的方法。