Recent advances in 3D human shape reconstruction from single images have shown impressive results, leveraging on deep networks that model the so-called implicit function to learn the occupancy status of arbitrarily dense 3D points in space. However, while current algorithms based on this paradigm, like PiFuHD, are able to estimate accurate geometry of the human shape and clothes, they require high-resolution input images and are not able to capture complex body poses. Most training and evaluation is performed on 1k-resolution images of humans standing in front of the camera under neutral body poses. In this paper, we leverage publicly available data to extend existing implicit function-based models to deal with images of humans that can have arbitrary poses and self-occluded limbs. We argue that the representation power of the implicit function is not sufficient to simultaneously model details of the geometry and of the body pose. We, therefore, propose a coarse-to-fine approach in which we first learn an implicit function that maps the input image to a 3D body shape with a low level of detail, but which correctly fits the underlying human pose, despite its complexity. We then learn a displacement map, conditioned on the smoothed surface and on the input image, which encodes the high-frequency details of the clothes and body. In the experimental section, we show that this coarse-to-fine strategy represents a very good trade-off between shape detail and pose correctness, comparing favorably to the most recent state-of-the-art approaches. Our code will be made publicly available.
翻译:从单一图像中重建3D人类形状的最近进展显示了令人印象深刻的结果,在深层网络上利用了所谓的隐含功能模型,以学习空间中任意密集的三维点的占用状态。然而,尽管目前基于这一范式的算法,如PiFuHD,能够对人体形状和衣服的精确几何进行估计,但它们需要高分辨率的输入图像,无法捕捉复杂的身体构成。大多数培训和评价都是在1k分辨率的图像上进行的,站在中立体的摄像头面前的人类图像。在本文中,我们利用公开可得的数据扩展现有隐含功能的模型,以处理可能具有任意姿势和自我隐蔽四肢的人类图像。我们争辩说,隐含功能的表达力不足以同时对人的形状和衣服进行精确的描述。我们首先学习一种隐含的功能,将输入图像映射到3D形的图像上,其详细程度很低,但与基本基于功能的模型相近,尽管其复杂性具有任意的外观和自我隐含的外观。我们随后学习了一种可公开的外观的外观的外观的外观和正态,我们用来显示我们表面和正态的正态的正态的正态的正态,我们将在正态的面和正态的面和正态的正态的正态的正态的面和正态的正态,将展示。