We propose embodied scene-aware human pose estimation where we estimate 3D poses based on a simulated agent's proprioception and scene awareness, along with external third-person observations. Unlike prior methods that often resort to multistage optimization, non-causal inference, and complex contact modeling to estimate human pose and human scene interactions, our method is one stage, causal, and recovers global 3D human poses in a simulated environment. Since 2D third-person observations are coupled with the camera pose, we propose to disentangle the camera pose and use a multi-step projection gradient defined in the global coordinate frame as the movement cue for our embodied agent. Leveraging a physics simulation and prescanned scenes (e.g., 3D mesh), we simulate our agent in everyday environments (libraries, offices, bedrooms, etc.) and equip our agent with environmental sensors to intelligently navigate and interact with scene geometries. Our method also relies only on 2D keypoints and can be trained on synthetic datasets derived from popular human motion databases. To evaluate, we use the popular H36M and PROX datasets and, for the first time, achieve a success rate of 96.7% on the challenging PROX dataset without ever using PROX motion sequences for training.
翻译:我们建议基于模拟剂自觉感知和场景认知以及外部第三人观测,对3D构成的测深进行测深。 与通常采用多阶段优化、非因果推断和复杂的接触模型来估计人造和人类场面互动的先前方法不同,我们的方法是一个阶段,是因果的,并在模拟环境中恢复了全球3D人体构成。由于2D第三人的观测与摄像头的外观相结合,我们建议解开相机的外观,并使用全球协调框架中定义的多步投影梯度作为我们内含剂的动作信号。利用物理模拟和预扫描场(例如3D网),我们在日常环境中模拟我们的代理(图书馆、办公室、卧室等),并给我们的代理器配备环境传感器,以便明智地导航和与现场地理分布进行互动。我们的方法还只依赖于2D关键点,并且可以按照从人类流行运动数据库中衍生出来的合成数据集来进行培训。我们首先用流行的H36M和96PROX数据率来评估,然后用流行的PROX数据速度,然后用流行的PROX数据序列进行测试。