Dubbing is a post-production process of re-recording actors' dialogues, which is extensively used in filmmaking and video production. It is usually performed manually by professional voice actors who read lines with proper prosody, and in synchronization with the pre-recorded videos. In this work, we propose Neural Dubber, the first neural network model to solve a novel automatic video dubbing (AVD) task: synthesizing human speech synchronized with the given video from the text. Neural Dubber is a multi-modal text-to-speech (TTS) model that utilizes the lip movement in the video to control the prosody of the generated speech. Furthermore, an image-based speaker embedding (ISE) module is developed for the multi-speaker setting, which enables Neural Dubber to generate speech with a reasonable timbre according to the speaker's face. Experiments on the chemistry lecture single-speaker dataset and LRS2 multi-speaker dataset show that Neural Dubber can generate speech audios on par with state-of-the-art TTS models in terms of speech quality. Most importantly, both qualitative and quantitative evaluations show that Neural Dubber can control the prosody of synthesized speech by the video, and generate high-fidelity speech temporally synchronized with the video.
翻译:Dubbing是重新记录行为者对话的后制作过程,广泛用于电影制作和视频制作。通常由专业声音行为者手动操作,他们以适当的手动方式阅读文字,并与预先录制的录像同步。在这项工作中,我们提出神经杜贝尔(Neal Dubber),这是第一个神经网络模型,用来解决新颖的自动视频杜贝(AVD)任务:合成与文本视频同步的人类语言。神经杜贝尔(Neural Dubber)是一个多式文本到语音模型(TTS),它利用视频中的嘴唇运动来控制所制作的演讲动作。此外,为多声器设置了一个基于图像的发言者嵌入模块(ISE),使神经杜贝尔能够用一个合理的图像调音调模式,根据发言者的面貌相,将单声调数据集和LRS2多式语音数据集显示,Neural Dubber(TTS)能够用高声调的音质和高声调的语音模型生成。最重要的是,通过高压的音质和高压的音压的语音控制,可以产生高压的音调的音调的音调。