Many recent works have been proposed for face image editing by leveraging the latent space of pretrained GANs. However, few attempts have been made to directly apply them to videos, because 1) they do not guarantee temporal consistency, 2) their application is limited by their processing speed on videos, and 3) they cannot accurately encode details of face motion and expression. To this end, we propose a novel network to encode face videos into the latent space of StyleGAN for semantic face video manipulation. Based on the vision transformer, our network reuses the high-resolution portion of the latent vector to enforce temporal consistency. To capture subtle face motions and expressions, we design novel losses that involve sparse facial landmarks and dense 3D face mesh. We have thoroughly evaluated our approach and successfully demonstrated its application to various face video manipulations. Particularly, we propose a novel network for pose/expression control in a 3D coordinate system. Both qualitative and quantitative results have shown that our approach can significantly outperform existing single image methods, while achieving real-time (66 fps) speed.
翻译:最近许多作品都提议通过利用经过预先训练的GANs的潜在空间进行面部图像编辑。然而,很少人试图直接将其应用到视频中,因为(1) 它们不能保证时间的一致性,(2) 它们的应用受到视频处理速度的限制,(2) 它们的应用无法准确地编码脸部运动和表达方式的细节。 为此,我们提议建立一个新网络,将脸部视频编码到SteleGAN的潜部空间,用于语言性脸部视频操作。基于视觉变压器,我们的网络重新利用了潜向矢量的高分辨率部分,以强制执行时间一致性。为了捕捉微妙的面部动作和表达方式,我们设计了新的损失,涉及分散的面部标志和密度的3D面部网格。我们已经彻底评估了我们的方法,并成功地展示了它应用于各种面部视频操作。特别是,我们提议了一个3D协调系统中的面部/面部/面部控制新网络。 定性和定量结果都表明,我们的方法可以大大超越现有的单一图像方法,同时实现实时(66fps)速度。