Inspired by the impressive performance of recent face image editing methods, several studies have been naturally proposed to extend these methods to the face video editing task. One of the main challenges here is temporal consistency among edited frames, which is still unresolved. To this end, we propose a novel face video editing framework based on diffusion autoencoders that can successfully extract the decomposed features - for the first time as a face video editing model - of identity and motion from a given video. This modeling allows us to edit the video by simply manipulating the temporally invariant feature to the desired direction for the consistency. Another unique strength of our model is that, since our model is based on diffusion models, it can satisfy both reconstruction and edit capabilities at the same time, and is robust to corner cases in wild face videos (e.g. occluded faces) unlike the existing GAN-based methods.
翻译:受最近面部图像编辑方法令人印象深刻的启发,我们自然提出几项研究,将这些方法扩大到面部视频编辑任务,其中一项主要挑战就是编辑框架之间的时间一致性,这个问题仍未解决。为此,我们提出一个基于传播自动读数器的新颖面部视频编辑框架,它能够首次作为面部视频编辑模型,成功地提取一个特定视频的特征和动作的分解特征。这个模型让我们能够编辑视频,简单地将时间性不定的特征调整到预期的一致性方向。我们模型的另一个独特优点是,由于我们的模型以传播模型为基础,它既能同时满足重建和编辑能力,又能与现有的GAN方法不同,在野生视频(例如隐形面)中能够快速处理角落案例。