Full 3D estimation of human pose from a single image remains a challenging task despite many recent advances. In this paper, we explore the hypothesis that strong prior information about scene geometry can be used to improve pose estimation accuracy. To tackle this question empirically, we have assembled a novel $\textbf{Geometric Pose Affordance}$ dataset, consisting of multi-view imagery of people interacting with a variety of rich 3D environments. We utilized a commercial motion capture system to collect gold-standard estimates of pose and construct accurate geometric 3D CAD models of the scene itself. To inject prior knowledge of scene constraints into existing frameworks for pose estimation from images, we introduce a novel, view-based representation of scene geometry, a $\textbf{multi-layer depth map}$, which employs multi-hit ray tracing to concisely encode multiple surface entry and exit points along each camera view ray direction. We propose two different mechanisms for integrating multi-layer depth information pose estimation: input as encoded ray features used in lifting 2D pose to full 3D, and secondly as a differentiable loss that encourages learned models to favor geometrically consistent pose estimates. We show experimentally that these techniques can improve the accuracy of 3D pose estimates, particularly in the presence of occlusion and complex scene geometry.
翻译:尽管最近取得了许多进展,但从单一图像中充分估计人形的3D全值仍然是一项艰巨的任务。 在本文中,我们探讨了一个假设,即:可以使用关于现场几何的可靠先前信息来提高估计的准确性。为了从经验上解决这一问题,我们已收集了一本小说$textbf{Geopif{Geoprime Pose Afffordance}$数据集,其中包括与各种丰富的3D环境发生互动的人的多视图像。我们利用商业运动捕捉系统来收集组合的黄金标准估计值,并构建准确的场景的3D CAD模型本身。将现场限制的先前知识注入现有框架,以便从图像中作出估计,我们引入了基于视觉的新颖的场景几何测图、一个$\textbf{多层深度地图*$\ xmolxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx