Existing neural rendering methods for creating human avatars typically either require dense input signals such as video or multi-view images, or leverage a learned prior from large-scale specific 3D human datasets such that reconstruction can be performed with sparse-view inputs. Most of these methods fail to achieve realistic reconstruction when only a single image is available. To enable the data-efficient creation of realistic animatable 3D humans, we propose ELICIT, a novel method for learning human-specific neural radiance fields from a single image. Inspired by the fact that humans can easily reconstruct the body geometry and infer the full-body clothing from a single image, we leverage two priors in ELICIT: 3D geometry prior and visual semantic prior. Specifically, ELICIT introduces the 3D body shape geometry prior from a skinned vertex-based template model (i.e., SMPL) and implements the visual clothing semantic prior with the CLIP-based pre-trained models. Both priors are used to jointly guide the optimization for creating plausible content in the invisible areas. In order to further improve visual details, we propose a segmentation-based sampling strategy that locally refines different parts of the avatar. Comprehensive evaluations on multiple popular benchmarks, including ZJU-MoCAP, Human3.6M, and DeepFashion, show that ELICIT has outperformed current state-of-the-art avatar creation methods when only a single image is available. Code will be public for reseach purpose at https://elicit3d.github.io .
翻译:创建人类变异器的现有神经转换方法通常需要大量输入信号,例如视频或多视图图像,或者利用从大规模特定 3D 人类数据集中学得的先前学习的3D 人类数据集,这样可以以稀疏的图像进行重建。 这些方法大多无法在只有单一图像时实现现实的重建。 为了能够以数据效率创建现实的3D人类, 我们提议 ELICIT, 这是一种从单一图像中学习人类特定神经亮度域的新颖方法。 由于人类可以轻松重建身体的几何和从单一图像中推导全体服装, 我们在 ELICIT 中利用了两个前两个前科: 3D 地理测量前和视觉语义前。 具体地说, ELICIT 引入了3D 体形状在基于皮肤的顶点模板模型( i.e., SMPL) 之前, 并用 CLIP 基础的先导模型执行直观服装结构。 之前, 两者都用来共同指导在隐形区域创建可信的 Real- reach restial redudeal laction 。 为了更精确的当前图像, 我们提议在本地的图像分析中, 演示中, 演示中, 演示的单个的单个的单个部分, 将展示。