Egocentric 3D human pose estimation with a single head-mounted fisheye camera has recently attracted attention due to its numerous applications in virtual and augmented reality. Existing methods still struggle in challenging poses where the human body is highly occluded or is closely interacting with the scene. To address this issue, we propose a scene-aware egocentric pose estimation method that guides the prediction of the egocentric pose with scene constraints. To this end, we propose an egocentric depth estimation network to predict the scene depth map from a wide-view egocentric fisheye camera while mitigating the occlusion of the human body with a depth-inpainting network. Next, we propose a scene-aware pose estimation network that projects the 2D image features and estimated depth map of the scene into a voxel space and regresses the 3D pose with a V2V network. The voxel-based feature representation provides the direct geometric connection between 2D image features and scene geometry, and further facilitates the V2V network to constrain the predicted pose based on the estimated scene geometry. To enable the training of the aforementioned networks, we also generated a synthetic dataset, called EgoGTA, and an in-the-wild dataset based on EgoPW, called EgoPW-Scene. The experimental results of our new evaluation sequences show that the predicted 3D egocentric poses are accurate and physically plausible in terms of human-scene interaction, demonstrating that our method outperforms the state-of-the-art methods both quantitatively and qualitatively.
翻译:为解决这一问题,我们提出了一个以自我觉醒为主的3D人形估计方法,用以引导对以自我中心为主的场景和场景限制的预测。为此,我们提议了一个以自我中心为主的深度估计网络,用宽视自我中心鱼眼摄影机预测场景深度图,同时用深度涂料网络减少人体的封闭性。接下来,我们提议了一个场景觉评估网络,将2D图像特征和估计的场景深度图投进一个V2V网络,并用V2V2V网络反向3D的3D外观。基于oxel的地貌代表提供了2D图像特征和场景几何测量性之间的直接几何联系,并进一步促进V2P网络,以深度涂料网络缩小人体体格的封闭性,同时用深度涂料网络缩小人体体格的封闭性。为了能够将2D图像特征和估计的深度图组构成一个网络,我们还生成了一个将2D图像和直径的直观性数据序列,我们用EGTA和直观的直观数据显示了我们的直观数据序列。