We present LipDiffuser, a conditional diffusion model for lip-to-speech generation synthesizing natural and intelligible speech directly from silent video recordings. Our approach leverages the magnitude-preserving ablated diffusion model (MP-ADM) architecture as a denoiser model. To effectively condition the model, we incorporate visual features using magnitude-preserving feature-wise linear modulation (MP-FiLM) alongside speaker embeddings. A neural vocoder then reconstructs the speech waveform from the generated mel-spectrograms. Evaluations on LRS3 demonstrate that LipDiffuser outperforms existing lip-to-speech baselines in perceptual speech quality and speaker similarity, while remaining competitive in downstream automatic speech recognition. These findings are also supported by a formal listening experiment.
翻译:本文提出LipDiffuser,一种用于唇语到语音生成的条件扩散模型,能够直接从静音视频中合成自然且清晰的语音。我们的方法采用保幅消融扩散模型(MP-ADM)架构作为去噪模型。为有效实现条件控制,我们结合使用保幅特征线性调制(MP-FiLM)提取的视觉特征与说话人嵌入向量。随后通过神经声码器从生成的梅尔频谱图重建语音波形。在LRS3数据集上的评估表明,LipDiffuser在感知语音质量和说话人相似度方面优于现有唇语到语音基线模型,同时在下游自动语音识别任务中保持竞争力。这些发现亦通过正式听力实验得到验证。