A key challenge of learning the geometry of dressed humans lies in the limited availability of the ground truth data (e.g., 3D scanned models), which results in the performance degradation of 3D human reconstruction when applying to real-world imagery. We address this challenge by leveraging a new data resource: a number of social media dance videos that span diverse appearance, clothing styles, performances, and identities. Each video depicts dynamic movements of the body and clothes of a single person while lacking the 3D ground truth geometry. To utilize these videos, we present a new method to use the local transformation that warps the predicted local geometry of the person from an image to that of another image at a different time instant. This allows self-supervision as enforcing a temporal coherence over the predictions. In addition, we jointly learn the depth along with the surface normals that are highly responsive to local texture, wrinkle, and shade by maximizing their geometric consistency. Our method is end-to-end trainable, resulting in high fidelity depth estimation that predicts fine geometry faithful to the input real image. We demonstrate that our method outperforms the state-of-the-art human depth estimation and human shape recovery approaches on both real and rendered images.
翻译:学习穿衣人几何学的关键挑战在于地面真象数据(如3D扫描模型)的可用性有限,这导致3D人重建在应用现实世界图像时的性能退化。我们通过利用新的数据资源应对这一挑战:包括不同外观、服装风格、表演和身份在内的一些社交媒体舞蹈视频。每部视频描述一个人的身体和衣服动态运动,同时缺乏3D地面真象测量。为了使用这些视频,我们提出了一种新的方法,用这种方法将一个人的预测本地几何从图像转换为不同时间的另一图像的本地几何。这让自我监督的视野成为了预测的时空一致性。此外,我们共同学习深度和表面常态,这些常态高度适应当地纹理、皱纹和阴影,从而最大限度地实现其几何一致性。我们的方法是端至端可训练,从而产生高度忠诚的深度估计,从而预测出对输入真实图像的精确几何精确度。我们用的方法在深度和深度上都超越了真实的图像的形状。我们展示了真实的模型和人类的形状。