Image editing using a pretrained StyleGAN generator has emerged as a powerful paradigm for facial editing, providing disentangled controls over age, expression, illumination, etc. However, the approach cannot be directly adopted for video manipulations. We hypothesize that the main missing ingredient is the lack of fine-grained and disentangled control over face location, face pose, and local facial expressions. In this work, we demonstrate that such a fine-grained control is indeed achievable using pretrained StyleGAN by working across multiple (latent) spaces (namely, the positional space, the W+ space, and the S space) and combining the optimization results across the multiple spaces. Building on this enabling component, we introduce Video2StyleGAN that takes a target image and driving video(s) to reenact the local and global locations and expressions from the driving video in the identity of the target image. We evaluate the effectiveness of our method over multiple challenging scenarios and demonstrate clear improvements over alternative approaches.
翻译:使用预先培训过的 StyleGAN 生成器进行图像编辑,已成为面部编辑的强大范例,为年龄、表达、光化等提供了分解控制。 但是,无法直接对视频操作采用这种方法。 我们假设缺少的主要成分是缺乏对面部位置、面部姿势和当地面部表情的精细刻和分解控制。 在这项工作中,我们证明,通过在多个(相对)空间(即定位空间、W+空间和S空间)开展工作,并结合多个空间的优化结果,这种精细刻控制确实可以实现。 我们在此赋能部分的基础上,引入视频2StyleGAN, 拍摄目标图像并驱动视频, 以目标图像的身份重新激活本地和全球位置以及驱动视频的表达方式。 我们评估了我们方法在多重挑战性情景中的有效性,并展示了替代方法的清晰改进。