The task of Visual Text-to-Speech (VisualTTS), also known as video dubbing, aims to generate speech synchronized with the lip movements in an input video, in additional to being consistent with the content of input text and cloning the timbre of a reference speech. Existing VisualTTS models typically adopt lightweight architectures and design specialized modules to achieve the above goals respectively, yet the speech quality is not satisfied due to the model capacity and the limited data in VisualTTS. Recently, speech large language models (SpeechLLM) show the robust ability to generate high-quality speech. But few work has been done to well leverage temporal cues from video input in generating lip-synchronized speech. To generate both high-quality and lip-synchronized speech in VisualTTS tasks, we propose a novel Visual Speech Language Model called VSpeechLM based upon a SpeechLLM. To capture the synchronization relationship between text and video, we propose a text-video aligner. It first learns fine-grained alignment between phonemes and lip movements, and then outputs an expanded phoneme sequence containing lip-synchronization cues. Next, our proposed SpeechLLM based decoders take the expanded phoneme sequence as input and learns to generate lip-synchronized speech. Extensive experiments demonstrate that our VSpeechLM significantly outperforms previous VisualTTS methods in terms of overall quality, speaker similarity, and synchronization metrics.
翻译:视觉文本到语音(VisualTTS)任务,亦称视频配音,旨在生成与输入视频中唇部动作同步的语音,同时确保与输入文本内容一致并克隆参考语音的音色。现有VisualTTS模型通常采用轻量级架构并设计专门模块以分别实现上述目标,但由于模型容量及VisualTTS数据有限,语音质量仍不尽如人意。近期,语音大语言模型(SpeechLLM)展现出生成高质量语音的强大能力,但鲜有研究能充分利用视频输入中的时序线索以生成唇部同步的语音。为在VisualTTS任务中同时实现高质量与唇部同步的语音生成,我们基于SpeechLLM提出一种新型视觉语音语言模型VSpeechLM。为捕捉文本与视频间的同步关系,我们提出一种文本-视频对齐器,其首先学习音素与唇部动作间的细粒度对齐,随后输出包含唇部同步线索的扩展音素序列。接着,我们提出的基于SpeechLLM的解码器以该扩展音素序列为输入,学习生成唇部同步的语音。大量实验表明,我们的VSpeechLM在整体质量、说话人相似度及同步性指标上均显著优于先前的VisualTTS方法。