DAE-Talker: 由扩散自编码器产生高保真度基于语音驱动的说话面部生成 (DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder)

While recent research has made significant progress in speech-driven talking face generation, the quality of the generated video still lags behind that of real recordings. One reason for this is the use of handcrafted intermediate representations like facial landmarks and 3DMM coefficients, which are designed based on human knowledge and are insufficient to precisely describe facial movements. Additionally, these methods require an external pretrained model for extracting these representations, whose performance sets an upper bound on talking face generation. To address these limitations, we propose a novel method called DAE-Talker that leverages data-driven latent representations obtained from a diffusion autoencoder (DAE). DAE contains an image encoder that encodes an image into a latent vector and a DDIM image decoder that reconstructs the image from it. We train our DAE on talking face video frames and then extract their latent representations as the training target for a Conformer-based speech2latent model. This allows DAE-Talker to synthesize full video frames and produce natural head movements that align with the content of speech, rather than relying on a predetermined head pose from a template video. We also introduce pose modelling in speech2latent for pose controllability. Additionally, we propose a novel method for generating continuous video frames with the DDIM image decoder trained on individual frames, eliminating the need for modelling the joint distribution of consecutive frames directly. Our experiments show that DAE-Talker outperforms existing popular methods in lip-sync, video fidelity, and pose naturalness. We also conduct ablation studies to analyze the effectiveness of the proposed techniques and demonstrate the pose controllability of DAE-Talker.

翻译：近期的研究在基于语音驱动的说话面部生成方面取得了重大进展，但所生成的视频质量仍然落后于实际录制的视频。一个原因是使用手工制作的中间表示，如面部特征点和3DMM系数，这些表示是基于人类知识设计的，不能准确地描述面部动作。此外，这些方法需要外部预训练模型来提取这些表示，其性能确定了说话面部生成的上限。为了解决这些限制，我们提出了一种名为DAE-Talker的新方法，它利用从扩散自编码器（DAE）中获得的数据驱动潜在表示。DAE包含一个图像编码器，将图像编码成潜在向量，以及一个DDIM图像解码器，从中重新构建图像。我们在说话面部视频帧上训练我们的DAE，然后提取它们的潜在表示作为基于转换器的语音到潜在表示模型的训练目标。这使得DAE-Talker能够合成完整的视频帧并产生与语音内容对齐的自然头部动作，而不是依赖于预设的模板视频中的固定头部姿势。我们还引入了姿势建模来进行姿势可控性。此外，我们提出了一种新方法，使用在单个帧上训练的DDIM图像解码器生成连续的视频帧，从而消除了直接对连续帧的联合分布进行建模的必要性。我们的实验表明，DAE-Talker在口型同步性，视频保真度和姿势自然度方面优于现有的流行方法。我们还进行了消融研究，以分析所提出技术的有效性，并展示DAE-Talker的姿势可控性。