Talking head generation is to generate video based on a given source identity and target motion. However, current methods face several challenges that limit the quality and controllability of the generated videos. First, the generated face often has unexpected deformation and severe distortions. Second, the driving image does not explicitly disentangle movement-relevant information, such as poses and expressions, which restricts the manipulation of different attributes during generation. Third, the generated videos tend to have flickering artifacts due to the inconsistency of the extracted landmarks between adjacent frames. In this paper, we propose a novel model that produces high-fidelity talking head videos with free control over head pose and expression. Our method leverages both self-supervised learned landmarks and 3D face model-based landmarks to model the motion. We also introduce a novel motion-aware multi-scale feature alignment module to effectively transfer the motion without face distortion. Furthermore, we enhance the smoothness of the synthesized talking head videos with a feature context adaptation and propagation module. We evaluate our model on challenging datasets and demonstrate its state-of-the-art performance. More information is available at https://yuegao.me/PECHead.
翻译:说话人头像生成旨在基于给定的源身份和目标运动生成视频。然而,目前的方法面临着一些挑战,这些挑战限制了生成的视频质量和可控性。首先,生成的人脸经常出现意外变形和严重扭曲。其次,驱动图像没有明确分离运动相关信息,如姿态和表情,这限制了在生成过程中操作不同属性的能力。第三,由于提取出来的地标点之间的一致性不稳定,在连续的帧之间生成的视频往往具有闪烁的伪影。本文提出了一种新的模型,可以产生高保真的说话人头像视频,并且自由控制头部姿态和表情。我们的方法利用了既可以自我监督地学习地标点,也可以使用基于3D人脸模型的地标点来建模运动。我们还引入了一种新颖的运动感知的多尺度特征对齐模块,可以在不扭曲面部的情况下有效地转移运动。此外,我们使用特征上下文适应和传播模块增强了合成说话人头像视频的平滑性。我们在具有挑战性的数据集上评估了模型,并证明了其最先进的性能。更多信息请访问https://yuegao.me/PECHead。