Realistic generative face video synthesis has long been a pursuit in both computer vision and graphics community. However, existing face video generation methods tend to produce low-quality frames with drifted facial identities and unnatural movements. To tackle these challenges, we propose a principled framework named StyleFaceV, which produces high-fidelity identity-preserving face videos with vivid movements. Our core insight is to decompose appearance and pose information and recompose them in the latent space of StyleGAN3 to produce stable and dynamic results. Specifically, StyleGAN3 provides strong priors for high-fidelity facial image generation, but the latent space is intrinsically entangled. By carefully examining its latent properties, we propose our decomposition and recomposition designs which allow for the disentangled combination of facial appearance and movements. Moreover, a temporal-dependent model is built upon the decomposed latent features, and samples reasonable sequences of motions that are capable of generating realistic and temporally coherent face videos. Particularly, our pipeline is trained with a joint training strategy on both static images and high-quality video data, which is of higher data efficiency. Extensive experiments demonstrate that our framework achieves state-of-the-art face video generation results both qualitatively and quantitatively. Notably, StyleFaceV is capable of generating realistic $1024\times1024$ face videos even without high-resolution training videos.
翻译:长期以来,在计算机视觉和图形界都追求真实的面部图像合成。然而,现有的面部视频生成方法往往产生低质量框架,其面部特征和不正常的动作会漂移。为了应对这些挑战,我们提议了一个名为StyleFaceV的原则性框架,它能产生高度不忠的面部和运动组合。此外,我们的核心洞察力是分解外观和提供信息,并在SteleGAN3的潜藏空间中重新组合信息,以产生稳定和动态的面部视频。具体地说,StelegGAN3为高不忠面部图像生成提供了强有力的前科,但潜在的空间却在本质上相互纠缠绕。通过仔细研究其潜在特性,我们建议了我们的分解和再组合设计,使面部外观和动作能够分解的组合。此外,一个基于时间的模型建于分解的潜在特征,以及能够产生现实和时间一致的面部图像的合理动作序列。我们的管道经过一项联合培训战略,关于静态图像和高质量视频数据数据数据数据生成效率甚至更高。10度的图像制作高清晰度和高清晰度的视频实验展示。