Generating photo-realistic video portrait with arbitrary speech audio is a crucial problem in film-making and virtual reality. Recently, several works explore the usage of neural radiance field in this task to improve 3D realness and image fidelity. However, the generalizability of previous NeRF-based methods to out-of-domain audio is limited by the small scale of training data. In this work, we propose GeneFace, a generalized and high-fidelity NeRF-based talking face generation method, which can generate natural results corresponding to various out-of-domain audio. Specifically, we learn a variaitional motion generator on a large lip-reading corpus, and introduce a domain adaptative post-net to calibrate the result. Moreover, we learn a NeRF-based renderer conditioned on the predicted facial motion. A head-aware torso-NeRF is proposed to eliminate the head-torso separation problem. Extensive experiments show that our method achieves more generalized and high-fidelity talking face generation compared to previous methods.
翻译:在制作电影和虚拟现实时,一个至关重要的问题。最近,一些作品探索了在这一任务中使用神经光亮场,以提高3D真实性和图像忠诚性。然而,以前基于NeRF的音频外出音频方法的通用性受到培训数据规模小的限制。在这项工作中,我们提议GeneFace,这是一个普遍和高度不理解的NERF语访谈面部生成方法,可以产生与各种外出音频相匹配的自然结果。具体地说,我们在大型读唇机上学习了一种液态运动生成器,并引入了一种对结果进行校准的域性调整后网络。此外,我们学习了以预测的面部动作为条件的基于NERF的变音器。我们提议了一种以头部和高侧面部分离为主的图像生成法,以消除头部和高侧面部分离问题。广泛的实验表明,我们的方法比以往的方法更加普及和高度和高侧面部对话生成。