Recovering world-coordinate human motion from monocular videos with humanoid robot retargeting is significant for embodied intelligence and robotics. To avoid complex SLAM pipelines or heavy temporal models, we propose a lightweight, engineering-oriented framework that leverages SAM 3D Body (3DB) as a frozen perception backbone and uses the Momentum HumanRig (MHR) representation as a robot-friendly intermediate. Our method (i) locks the identity and skeleton-scale parameters of per tracked subject to enforce temporally consistent bone lengths, (ii) smooths per-frame predictions via efficient sliding-window optimization in the low-dimensional MHR latent space, and (iii) recovers physically plausible global root trajectories with a differentiable soft foot-ground contact model and contact-aware global optimization. Finally, we retarget the reconstructed motion to the Unitree G1 humanoid using a kinematics-aware two-stage inverse kinematics pipeline. Results on real monocular videos show that our method has stable world trajectories and reliable robot retargeting, indicating that structured human representations with lightweight physical constraints can yield robot-ready motion from monocular input.
翻译:从单目视频中恢复世界坐标系下的人体运动并重定向至仿人机器人,对于具身智能与机器人学具有重要意义。为避免复杂的SLAM流程或繁重的时序模型,我们提出一种轻量级、面向工程化的框架,该框架利用SAM 3D Body (3DB)作为冻结的感知骨干网络,并采用Momentum HumanRig (MHR)表示作为机器人友好的中间表示。我们的方法:(i) 锁定每个被跟踪目标的身份与骨骼尺度参数,以强制保持时间上一致的骨骼长度;(ii) 通过在低维MHR潜在空间中进行高效的滑动窗口优化,平滑逐帧预测结果;(iii) 利用可微分的软性足地接触模型及接触感知的全局优化,恢复物理上合理的全局根节点轨迹。最后,我们通过一个运动学感知的两阶段逆运动学流程,将重建的运动重定向至Unitree G1仿人机器人。在真实单目视频上的实验结果表明,我们的方法具有稳定的世界轨迹和可靠的机器人重定向效果,这表明结合轻量级物理约束的结构化人体表示能够从单目输入中生成适用于机器人的运动。