Neural Parametric Head Models (NPHMs) are a recent advancement over mesh-based 3d morphable models (3DMMs) to facilitate high-fidelity geometric detail. However, fitting NPHMs to visual inputs is notoriously challenging due to the expressive nature of their underlying latent space. To this end, we propose Pix2NPHM, a vision transformer (ViT) network that directly regresses NPHM parameters, given a single image as input. Compared to existing approaches, the neural parametric space allows our method to reconstruct more recognizable facial geometry and accurate facial expressions. For broad generalization, we exploit domain-specific ViTs as backbones, which are pretrained on geometric prediction tasks. We train Pix2NPHM on a mixture of 3D data, including a total of over 100K NPHM registrations that enable direct supervision in SDF space, and large-scale 2D video datasets, for which normal estimates serve as pseudo ground truth geometry. Pix2NPHM not only allows for 3D reconstructions at interactive frame rates, it is also possible to improve geometric fidelity by a subsequent inference-time optimization against estimated surface normals and canonical point maps. As a result, we achieve unprecedented face reconstruction quality that can run at scale on in-the-wild data.
翻译:神经参数化头部模型(NPHMs)是相较于基于网格的三维形变模型(3DMMs)的一项最新进展,旨在实现高保真度的几何细节。然而,由于其底层潜在空间的高度表达能力,将NPHM拟合到视觉输入数据上极具挑战性。为此,我们提出了Pix2NPHM,一种视觉Transformer(ViT)网络,能够在给定单张输入图像的情况下直接回归NPHM参数。与现有方法相比,神经参数化空间使我们的方法能够重建更具辨识度的面部几何结构和更准确的面部表情。为了实现广泛的泛化能力,我们利用在几何预测任务上预训练的领域特定ViT作为主干网络。我们在混合的三维数据上训练Pix2NPHM,其中包括总计超过10万次的NPHM配准数据(可在SDF空间中提供直接监督)以及大规模二维视频数据集(其法线估计可作为伪真实几何)。Pix2NPHM不仅能够以交互帧率进行三维重建,还能通过后续基于估计表面法线和规范点图的推理时优化来提升几何保真度。因此,我们实现了前所未有的面部重建质量,并能够大规模应用于真实场景数据。