Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as visual voice clone V2C) task aims to generate speeches that match the speaker's emotion presented in the video using the desired speaker voice as reference. V2C is more challenging than conventional text-to-speech tasks as it additionally requires the generated speech to exactly match the varying emotions and speaking speed presented in the video. Unlike previous works, we propose a novel movie dubbing architecture to tackle these problems via hierarchical prosody modelling, which bridges the visual information to corresponding speech prosody from three aspects: lip, face, and scene. Specifically, we align lip movement to the speech duration, and convey facial expression to speech energy and pitch via attention mechanism based on valence and arousal representations inspired by recent psychology findings. Moreover, we design an emotion booster to capture the atmosphere from global video scenes. All these embeddings together are used to generate mel-spectrogram and then convert to speech waves via existing vocoder. Extensive experimental results on the Chem and V2C benchmark datasets demonstrate the favorable performance of the proposed method. The source code and trained models will be released to the public.
 翻译:给定一段文本、一个视频片段和一个参考音频,电影配音(也称作视觉声音克隆 V2C)任务旨在生成符合视频中演讲者情感的语音,同时使用期望的演讲者声音作为参考。V2C比传统的文本到语音任务更具挑战性,因为它还需要生成的语音精确地匹配视频中呈现的不断变化的情感和说话速度。与之前的工作不同,我们提出了一种新颖的电影配音架构,通过分层韵律建模来解决这些问题,该模型将视觉信息从三个方面与相应的语音韵律联系起来:唇、脸和场景。具体而言,我们将唇部运动与语音持续时间进行对齐,并通过受最近的心理学研究启发的价值和唤起表示的注意力机制,将面部表情传达给语音能量和音高。此外,我们设计了一款情感增强器,用于捕捉全局视频场景的氛围。所有这些嵌入一起用于生成梅尔频谱图,然后通过现有的语音合成器转换为语音波形。对Chem和V2C基准数据集的广泛实验结果表明了所提出方法的良好性能。源代码和训练模型将向公众发布。