Existing Human NeRF methods for reconstructing 3D humans typically rely on multiple 2D images from multi-view cameras or monocular videos captured from fixed camera views. However, in real-world scenarios, human images are often captured from random camera angles, presenting challenges for high-quality 3D human reconstruction. In this paper, we propose SHERF, the first generalizable Human NeRF model for recovering animatable 3D humans from a single input image. SHERF extracts and encodes 3D human representations in canonical space, enabling rendering and animation from free views and poses. To achieve high-fidelity novel view and pose synthesis, the encoded 3D human representations should capture both global appearance and local fine-grained textures. To this end, we propose a bank of 3D-aware hierarchical features, including global, point-level, and pixel-aligned features, to facilitate informative encoding. Global features enhance the information extracted from the single input image and complement the information missing from the partial 2D observation. Point-level features provide strong clues of 3D human structure, while pixel-aligned features preserve more fine-grained details. To effectively integrate the 3D-aware hierarchical feature bank, we design a feature fusion transformer. Extensive experiments on THuman, RenderPeople, ZJU_MoCap, and HuMMan datasets demonstrate that SHERF achieves state-of-the-art performance, with better generalizability for novel view and pose synthesis.
翻译:现有的用于重建 3D 人类的人体神经元场景方法通常依赖于多视角相机拍摄的多个 2D 图像或从固定相机视角拍摄的单眼视频。然而,在现实世界中,人体图像通常从随机的摄像机角度拍摄,这对高质量的 3D 人体重建提出了挑战。在本文中,我们提出了 SHERF,这是第一个可从单个输入图像中恢复可动画 3D 人体的广义人体神经元场景模型。SHERF 在规范空间中提取和编码 3D 人体表示,实现从自由视角和姿态的渲染和动画。为实现高保真的新视角和姿态合成,编码的 3D 人体表示应该捕捉全局外观和局部细粒度纹理。为此,我们提出了一个三维感知的分层特征库,包括全局特征、点级特征和像素对齐特征,以促进信息编码。全局特征增强了从单个输入图像提取的信息,并补充了不完整的 2D 观测缺失的信息。点级特征提供了强有力的 3D 人体结构线索,而像素对齐特征则保留更多的细粒度细节。为了有效地集成三维感知的分层特征库,我们设计了一个特征融合变压器。在 THuman、RenderPeople、ZJU_MoCap 和 HuMMan 数据集上的大量实验表明,SHERF 实现了最先进的性能,并具有更好的新视角和姿态合成的普适性。