Unlike existing methods that rely on source images as appearance references and use source speech to generate motion, this work proposes a novel approach that directly extracts information from the speech, addressing key challenges in speech-to-talking face. Specifically, we first employ a speech-to-face portrait generation stage, utilizing a speech-conditioned diffusion model combined with statistical facial prior and a sample-adaptive weighting module to achieve high-quality portrait generation. In the subsequent speech-driven talking face generation stage, we embed expressive dynamics such as lip movement, facial expressions, and eye movements into the latent space of the diffusion model and further optimize lip synchronization using a region-enhancement module. To generate high-resolution outputs, we integrate a pre-trained Transformer-based discrete codebook with an image rendering network, enhancing video frame details in an end-to-end manner. Experimental results demonstrate that our method outperforms existing approaches on the HDTF, VoxCeleb, and AVSpeech datasets. Notably, this is the first method capable of generating high-resolution, high-quality talking face videos exclusively from a single speech input.
翻译:与现有方法依赖源图像作为外观参考并利用源语音生成动作不同,本研究提出一种直接从语音中提取信息的新方法,以应对语音驱动人脸生成中的关键挑战。具体而言,我们首先采用语音到人像生成阶段,结合语音条件扩散模型、统计面部先验以及样本自适应加权模块,实现高质量人像生成。在随后的语音驱动人脸生成阶段,我们将唇部运动、面部表情及眼部动作等动态特征嵌入扩散模型的隐空间,并通过区域增强模块进一步优化唇部同步。为生成高分辨率输出,我们整合了预训练的基于Transformer的离散码本与图像渲染网络,以端到端方式提升视频帧细节。实验结果表明,本方法在HDTF、VoxCeleb和AVSpeech数据集上均优于现有方法。值得注意的是,这是首个仅通过单一语音输入即可生成高分辨率、高质量人脸视频的方法。