一次隐式可动漫游角色生成方法及其基于模型的先验知识 (One-shot Implicit Animatable Avatars with Model-based Priors)

Existing neural rendering methods for creating human avatars typically either require dense input signals such as video or multi-view images, or leverage a learned prior from large-scale specific 3D human datasets such that reconstruction can be performed with sparse-view inputs. Most of these methods fail to achieve realistic reconstruction when only a single image is available. To enable the data-efficient creation of realistic animatable 3D humans, we propose ELICIT, a novel method for learning human-specific neural radiance fields from a single image. Inspired by the fact that humans can effortlessly estimate the body geometry and imagine full-body clothing from a single image, we leverage two priors in ELICIT: 3D geometry prior and visual semantic prior. Specifically, ELICIT utilizes the 3D body shape geometry prior from a skinned vertex-based template model (i.e., SMPL) and implements the visual clothing semantic prior with the CLIP-based pre-trained models. Both priors are used to jointly guide the optimization for creating plausible content in the invisible areas. Taking advantage of the CLIP models, ELICIT can use text descriptions to generate text-conditioned unseen regions. In order to further improve visual details, we propose a segmentation-based sampling strategy that locally refines different parts of the avatar. Comprehensive evaluations on multiple popular benchmarks, including ZJU-MoCAP, Human3.6M, and DeepFashion, show that ELICIT has outperformed strong baseline methods of avatar creation when only a single image is available. The code is public for research purposes at https://elicit3d.github.io/

翻译：现有的用于生成人类角色的神经渲染方法通常要求密集的输入信号，比如视频或多视图图像，或者利用从大规模具体 3D 人体数据集中学习到的先验知识，以便可以使用稀疏视图输入进行重建。当只有单张图像可用时，这些方法中的大多数都无法实现真实的重建。为了实现数据有效的逼真可动三维人类角色创作，我们提出了 ELICIT，一种从单个图像中学习人类特定神经辐射场的新方法。受到人类可以轻松估计身体几何形状并想象出全身服装的事实启发，我们在 ELICIT 中利用了两个先验知识：3D 几何形状先验和视觉语义先验。具体而言，ELICIT 利用了基于顶点的皮肤模板模型（即 SMPL）的 3D 身体形状几何先验，并使用基于 CLIP 预训练模型实现视觉服装语义先验。这两个先验被用来共同引导优化，以创建不可见区域的可信内容。利用 CLIP 模型，ELICIT 可以使用文本描述生成文本条件下的未见区域。为了进一步提高视觉细节，我们提出了一种基于分割的采样策略，用于局部优化人物角色的不同部位。对多个流行基准数据集进行的全面评估，包括 ZJU-MoCAP，Human3.6M 和 DeepFashion，表明当只有单张图像可用时，ELICIT 的表现优于可动漫游角色生成的强基线方法。代码公开发布，供研究目的使用。访问 https://elicit3d.github.io/ 获取代码。