While dynamic Neural Radiance Fields (NeRF) have shown success in high-fidelity 3D modeling of talking portraits, the slow training and inference speed severely obstruct their potential usage. In this paper, we propose an efficient NeRF-based framework that enables real-time synthesizing of talking portraits and faster convergence by leveraging the recent success of grid-based NeRF. Our key insight is to decompose the inherently high-dimensional talking portrait representation into three low-dimensional feature grids. Specifically, a Decomposed Audio-spatial Encoding Module models the dynamic head with a 3D spatial grid and a 2D audio grid. The torso is handled with another 2D grid in a lightweight Pseudo-3D Deformable Module. Both modules focus on efficiency under the premise of good rendering quality. Extensive experiments demonstrate that our method can generate realistic and audio-lips synchronized talking portrait videos, while also being highly efficient compared to previous methods.
翻译:虽然动态神经辐射场(NeRF)在对说话的肖像进行高度忠诚的3D建模方面取得了成功,但缓慢的训练和推断速度严重阻碍了它们的潜在使用。 在本文中,我们提出了一个高效的NeRF框架,通过利用最近基于网格的NeRF的成功来实时合成说话的肖像和更快的趋同。我们的关键洞察力是将内在的高维谈话肖像显示分解成三个低维特征网格。具体地说,一个分解的音频空间成像模块模型,用3D空间网和2D音频网组成动态头模型。在轻巧的 Pseudo-3D变形模块中用另外的2D网格处理托尔索。两个模块都侧重于在良好制成质量的前提下的效率。广泛的实验表明,我们的方法可以产生现实的和同步的语音翻动听的肖像视频,但与以往的方法相比效率也很高。