We introduce CRISP, a method that recovers simulatable human motion and scene geometry from monocular video. Prior work on joint human-scene reconstruction relies on data-driven priors and joint optimization with no physics in the loop, or recovers noisy geometry with artifacts that cause motion tracking policies with scene interactions to fail. In contrast, our key insight is to recover convex, clean, and simulation-ready geometry by fitting planar primitives to a point cloud reconstruction of the scene, via a simple clustering pipeline over depth, normals, and flow. To reconstruct scene geometry that might be occluded during interactions, we make use of human-scene contact modeling (e.g., we use human posture to reconstruct the occluded seat of a chair). Finally, we ensure that human and scene reconstructions are physically-plausible by using them to drive a humanoid controller via reinforcement learning. Our approach reduces motion tracking failure rates from 55.2\% to 6.9\% on human-centric video benchmarks (EMDB, PROX), while delivering a 43\% faster RL simulation throughput. We further validate it on in-the-wild videos including casually-captured videos, Internet videos, and even Sora-generated videos. This demonstrates CRISP's ability to generate physically-valid human motion and interaction environments at scale, greatly advancing real-to-sim applications for robotics and AR/VR.
翻译:本文提出CRISP方法,该方法能够从单目视频中恢复可仿真的三维人体运动与场景几何。现有的人体-场景联合重建方法主要依赖数据驱动的先验知识,并通过无物理约束的联合优化实现,或重建出带有伪影的噪声几何,导致包含场景交互的运动跟踪策略失效。与之相反,我们的核心思路是通过对场景点云进行深度、法向量和光流信息的简单聚类流程,拟合平面几何基元,从而恢复出凸性、干净且可直接用于仿真的几何结构。为重建交互过程中可能被遮挡的场景几何,我们利用人体-场景接触建模技术(例如,通过人体姿态重建被遮挡的椅面)。最后,我们通过强化学习驱动的人形控制器验证人体与场景重建结果的物理合理性。在人体中心视频基准数据集(EMDB、PROX)上,本方法将运动跟踪失败率从55.2%降低至6.9%,同时使强化学习仿真吞吐量提升43%。我们进一步在真实场景视频(包括随手拍摄视频、网络视频乃至Sora生成视频)上验证了方法的有效性。这证明了CRISP能够大规模生成物理有效的人体运动与交互环境,极大推动了机器人及AR/VR领域的真实到仿真应用发展。