Talking head synthesis is a promising approach for the video production industry. Recently, a lot of effort has been devoted in this research area to improve the generation quality or enhance the model generalization. However, there are few works able to address both issues simultaneously, which is essential for practical applications. To this end, in this paper, we turn attention to the emerging powerful Latent Diffusion Models, and model the Talking head generation as an audio-driven temporally coherent denoising process (DiffTalk). More specifically, instead of employing audio signals as the single driving factor, we investigate the control mechanism of the talking face, and incorporate reference face images and landmarks as conditions for personality-aware generalized synthesis. In this way, the proposed DiffTalk is capable of producing high-quality talking head videos in synchronization with the source audio, and more importantly, it can be naturally generalized across different identities without any further fine-tuning. Additionally, our DiffTalk can be gracefully tailored for higher-resolution synthesis with negligible extra computational cost. Extensive experiments show that the proposed DiffTalk efficiently synthesizes high-fidelity audio-driven talking head videos for generalized novel identities. For more video results, please refer to \url{https://sstzal.github.io/DiffTalk/}.
翻译:说话头合成是视频制作行业的一种有前途的方法。近来,研究人员在这个研究领域中投入了大量的心血,以提高生成质量或改善模型的普适性。然而,只有很少的工作能够同时解决这两个问题,这对于实际应用是必要的。为此,在本文中,我们将注意力转向新兴的强大潜在漫射模型,并将说话头生成建模为音频驱动的时间连贯去噪过程(DiffTalk)。更具体地说,我们研究了说话脸部的控制机制,并将参考面部图像和地标作为人格感知的广义合成条件。通过这种方式,所提出的DiffTalk能够产生与源音频同步的高质量说话头视频,并且更重要的是,它可以在不经过任何进一步微调的情况下自然地推广到不同的身份。此外,我们的DiffTalk可以优雅地定制高分辨率合成,几乎不需要额外的计算成本。广泛的实验表明,所提出的DiffTalk能够高效地合成高保真度的音频驱动说话头视频以用于广义的新领域。更多视频结果请参见\url {https://sstzal.github.io/DiffTalk/}。