For visual manipulation tasks, we aim to represent image content with semantically meaningful features. However, learning implicit representations from images often lacks interpretability, especially when attributes are intertwined. We focus on the challenging task of extracting disentangled 3D attributes only from 2D image data. Specifically, we focus on human appearance and learn implicit pose, shape and garment representations of dressed humans from RGB images. Our method learns an embedding with disentangled latent representations of these three image properties and enables meaningful re-assembling of features and property control through a 2D-to-3D encoder-decoder structure. The 3D model is inferred solely from the feature map in the learned embedding space. To the best of our knowledge, our method is the first to achieve cross-domain disentanglement for this highly under-constrained problem. We qualitatively and quantitatively demonstrate our framework's ability to transfer pose, shape, and garments in 3D reconstruction on virtual data and show how an implicit shape loss can benefit the model's ability to recover fine-grained reconstruction details.
翻译:对于视觉操作任务,我们的目标是以具有真实意义的功能来代表图像内容。然而,从图像中学习隐含的表达方式往往缺乏解释性,特别是在属性相互交织的情况下。我们侧重于从 2D 图像数据中提取解开的 3D 属性这一艰巨任务。具体地说,我们侧重于人类外观,从 RGB 图像中学习穿衣人的隐含面貌、形状和服装。我们的方法通过这三种图像属性的分解潜在表达形式来学习嵌入,并且能够通过 2D-3D 的编码-代解码器结构对特征和财产控制进行有意义的重新组合。 3D 模型完全从所学的嵌入空间的特征地图中推断出来。 根据我们的知识,我们的方法是首先为这个高度缺乏控制的问题实现交叉解开。我们从质量上和数量上展示了我们框架在虚拟数据重建3D 中转移形状、形状和服装的能力,并表明隐含的形状损失如何有利于模型恢复精细的重建细节的能力。