The 3D world limits the human body pose and the human body pose conveys information about the surrounding objects. Indeed, from a single image of a person placed in an indoor scene, we as humans are adept at resolving ambiguities of the human pose and room layout through our knowledge of the physical laws and prior perception of the plausible object and human poses. However, few computer vision models fully leverage this fact. In this work, we propose an end-to-end trainable model that perceives the 3D scene from a single RGB image, estimates the camera pose and the room layout, and reconstructs both human body and object meshes. By imposing a set of comprehensive and sophisticated losses on all aspects of the estimations, we show that our model outperforms existing human body mesh methods and indoor scene reconstruction methods. To the best of our knowledge, this is the first model that outputs both object and human predictions at the mesh level, and performs joint optimization on the scene and human poses.
翻译:3D世界限制人体的姿势,而人体的姿势会传递周围物体的信息。事实上,从一个人被安置在室内场景中的单一图像来看,我们作为人类,通过我们对物理定律的了解和对合理物体和人姿势的先前认识,能够解决人类姿势和房间布局的模糊不清之处。然而,很少有计算机视觉模型能充分利用这一事实。在这项工作中,我们提出了一个端到端的可训练模型,从一个RGB图像中看到3D场景,估计相机姿势和房间布局,并重建人体和物体布局。通过对估计的所有方面强加一套全面而复杂的损失,我们展示出我们的模型超越了现有的人体形体图和室内场景重建方法。据我们所知,这是第一个在Mesh一级产生物体和人类预测的模型,并在现场和人姿势上进行联合优化。