The ability of Generative Adversarial Networks to encode rich semantics within their latent space has been widely adopted for facial image editing. However, replicating their success with videos has proven challenging. Sets of high-quality facial videos are lacking, and working with videos introduces a fundamental barrier to overcome - temporal coherency. We propose that this barrier is largely artificial. The source video is already temporally coherent, and deviations from this state arise in part due to careless treatment of individual components in the editing pipeline. We leverage the natural alignment of StyleGAN and the tendency of neural networks to learn low frequency functions, and demonstrate that they provide a strongly consistent prior. We draw on these insights and propose a framework for semantic editing of faces in videos, demonstrating significant improvements over the current state-of-the-art. Our method produces meaningful face manipulations, maintains a higher degree of temporal consistency, and can be applied to challenging, high quality, talking head videos which current methods struggle with.
翻译:Generation Adversarial Network 在其潜藏空间内对丰富的语义进行编码的能力已被广泛采用,用于面部图像编辑。然而,通过视频复制其成功经验证明具有挑战性。缺少一套高质量的面部视频,与视频合作带来一个根本的克服障碍----时间一致性。我们建议这一屏障基本上是人为的。源视频在时间上已经很连贯,与这个状态的偏离部分是由于编辑管道中个别部件的处理不小心所致。我们利用StyleGAN的自然一致性和神经网络学习低频率功能的倾向,并证明它们提供了很强的先前一致性。我们利用这些洞见,提出了视频面部面语义编辑框架,展示了当前最新工艺的显著改进。我们的方法产生了有意义的面部操控,保持了更高程度的时间一致性,并可用于挑战当前方法所与的高品质、讲心的视频。