We propose a method to learn a high-quality implicit 3D head avatar from a monocular RGB video captured in the wild. The learnt avatar is driven by a parametric face model to achieve user-controlled facial expressions and head poses. Our hybrid pipeline combines the geometry prior and dynamic tracking of a 3DMM with a neural radiance field to achieve fine-grained control and photorealism. To reduce over-smoothing and improve out-of-model expressions synthesis, we propose to predict local features anchored on the 3DMM geometry. These learnt features are driven by 3DMM deformation and interpolated in 3D space to yield the volumetric radiance at a designated query point. We further show that using a Convolutional Neural Network in the UV space is critical in incorporating spatial context and producing representative local features. Extensive experiments show that we are able to reconstruct high-quality avatars, with more accurate expression-dependent details, good generalization to out-of-training expressions, and quantitatively superior renderings compared to other state-of-the-art approaches.
翻译:我们提出了一种从野外拍摄的单眼 RGB 视频中学习高质量隐式 3D 头部头像的方法。学习得到的头像由参数化面部模型驱动,以实现用户控制的面部表情和头部姿势。我们的混合流水线将 3DMM 的几何先验和动态跟踪与神经辐射场相结合,以实现精细的控制和照片般的真实感。为了减少平滑过度并改善模型外表情综合,我们建议预测锚定在 3DMM 几何上的局部特征。这些学习到的特征由 3DMM 变形驱动,在 3D 空间中进行插值,以在指定的查询点处产生体积辐射。我们进一步展示,在 UV 空间中使用卷积神经网络是关键的,可以纳入空间上下文并生成代表性的局部特征。广泛的实验证明,我们能够重建高质量的头像,具有更准确的表情相关细节,良好的泛化性能和比其他最先进的方法具有更好的定量渲染。