Inspired by the impressive performance of recent face image editing methods, several studies have been naturally proposed to extend these methods to the face video editing task. One of the main challenges here is temporal consistency among edited frames, which is still unresolved. To this end, we propose a novel face video editing framework based on diffusion autoencoders that can successfully extract the decomposed features - for the first time as a face video editing model - of identity and motion from a given video. This modeling allows us to edit the video by simply manipulating the temporally invariant feature to the desired direction for the consistency. Another unique strength of our model is that, since our model is based on diffusion models, it can satisfy both reconstruction and edit capabilities at the same time, and is robust to corner cases in wild face videos (e.g. occluded faces) unlike the existing GAN-based methods.
翻译:受最近面部图像编辑方法出色的性能表现启发,自然提出了几项研究,将这些方法扩展到面部视频编辑任务。其中一个主要挑战是编辑帧之间的时间一致性,这仍然没有得到解决。为此,我们提出了一种基于扩散自编码器的面部视频编辑框架,它可以成功地从给定的视频中提取分解特征——第一次作为面部视频编辑模型——分离出身份和运动。这种建模允许我们通过简单操纵时间不变特征来编辑视频,使其达到所需的一致性水平。我们模型的另一个独特优势是,由于我们的模型是基于扩散模型的,因此它可以同时满足重建和编辑能力,并且不像现有的基于GAN的方法那样容易受到野生面部视频中的角落情况(例如遮挡的面孔)的影响。