Ego-Body Pose 通过Ego - Head Pose 估算Ego - Head Pose 估算Ego - Body Pose 估算Ego - Body Pose 估算Ego - Head Pose 估算Ego - Head Pose 估算Ego - Head - Head Pose 估算Ego (Ego-Body Pose Estimation via Ego-Head Pose Estimation)

翻译：Ego-Body Pose 通过Ego - Head Pose 估算Ego - Head Pose 估算Ego - Body Pose 估算Ego - Body Pose 估算Ego - Head Pose 估算Ego - Head Pose 估算Ego - Head - Head Pose 估算Ego

Jiaman Li,C. Karen Liu,Jiajun Wu

from arxiv, project website: https://lijiaman.github.io/projects/egoego/

Estimating 3D human motion from an egocentric video sequence is critical to human behavior understanding and applications in VR/AR. However, naively learning a mapping between egocentric videos and human motions is challenging, because the user's body is often unobserved by the front-facing camera placed on the head of the user. In addition, collecting large-scale, high-quality datasets with paired egocentric videos and 3D human motions requires accurate motion capture devices, which often limit the variety of scenes in the videos to lab-like environments. To eliminate the need for paired egocentric video and human motions, we propose a new method, Ego-Body Pose Estimation via Ego-Head Pose Estimation (EgoEgo), that decomposes the problem into two stages, connected by the head motion as an intermediate representation. EgoEgo first integrates SLAM and a learning approach to estimate accurate head motion. Then, taking the estimated head pose as input, it leverages conditional diffusion to generate multiple plausible full-body motions. This disentanglement of head and body pose eliminates the need for training datasets with paired egocentric videos and 3D human motion, enabling us to leverage large-scale egocentric video datasets and motion capture datasets separately. Moreover, for systematic benchmarking, we develop a synthetic dataset, AMASS-Replica-Ego-Syn (ARES), with paired egocentric videos and human motion. On both ARES and real data, our EgoEgo model performs significantly better than the state-of-the-art.

翻译：从自我中心视频序列中估计3D人类运动对于人类行为理解和应用 VR/AR 至关重要。然而,天真地学习自我中心视频和人类动作之间的映射是富有挑战性的,因为用户的身体往往没有被放在用户头部的正面摄影机所观测到。此外,用对称自我中心视频和3D人类动作收集大规模高质量的数据集需要精确的动作捕捉装置,这些装置往往限制视频中的场景到实验室式环境。为了消除配对的自我中心视频和人类动作的需要,我们建议采用新方法,Ego-Body Poseimation(EgoEgoEgoEgo),通过Ego-Code Poseimation(Ego)将问题分解成两个阶段,由头动作作为中间代表连接。Ego首先将SLAM和学习方法结合起来,以估计准确的头型运动。然后,以估计头部姿势为单位,利用有条件的传播方式产生多种貌相貌相貌相仿的全体动作。这种头部和身体的触动方式,我们的头部和身体上的触动和身体上的震动,使Eloverial-De-Deal-de-deal-la-de-de-de-de-de-de-de-de-de-de-de-de-la-la-de-de-de-la-la-lax-la-la-la-la-la-la-la-la-la-la-de-de-de-de-de-lax-la-la-de-de-de-ladal-dal-ladal-de-lad-ladal-lad-de-de-de-de-lad-lad-de-de-lad-de-de-de-de-de-de-de-de-lad-lad-lad-lad-lad-ladal-ladal-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-la-de-de-de-de-de-de-de-de-lad-de-lad-de-de-de-de-de-la-de-de-de-de