根据脚本, (Neural Dubber: Dubbing for Videos According to Scripts)

Dubbing is a post-production process of re-recording actors' dialogues, which is extensively used in filmmaking and video production. It is usually performed manually by professional voice actors who read lines with proper prosody, and in synchronization with the pre-recorded videos. In this work, we propose Neural Dubber, the first neural network model to solve a novel automatic video dubbing (AVD) task: synthesizing human speech synchronized with the given video from the text. Neural Dubber is a multi-modal text-to-speech (TTS) model that utilizes the lip movement in the video to control the prosody of the generated speech. Furthermore, an image-based speaker embedding (ISE) module is developed for the multi-speaker setting, which enables Neural Dubber to generate speech with a reasonable timbre according to the speaker's face. Experiments on the chemistry lecture single-speaker dataset and LRS2 multi-speaker dataset show that Neural Dubber can generate speech audios on par with state-of-the-art TTS models in terms of speech quality. Most importantly, both qualitative and quantitative evaluations show that Neural Dubber can control the prosody of synthesized speech by the video, and generate high-fidelity speech temporally synchronized with the video.

翻译：Dubbing是重新记录行为者对话的后制作过程,广泛用于电影制作和视频制作。通常由专业声音行为者手动操作,他们以适当的手动方式阅读文字,并与预先录制的录像同步。在这项工作中,我们提出神经杜贝尔(Neal Dubber),这是第一个神经网络模型,用来解决新颖的自动视频杜贝(AVD)任务:合成与文本视频同步的人类语言。神经杜贝尔(Neural Dubber)是一个多式文本到语音模型(TTS),它利用视频中的嘴唇运动来控制所制作的演讲动作。此外,为多声器设置了一个基于图像的发言者嵌入模块(ISE),使神经杜贝尔能够用一个合理的图像调音调模式,根据发言者的面貌相,将单声调数据集和LRS2多式语音数据集显示,Neural Dubber(TTS)能够用高声调的音质和高声调的语音模型生成。最重要的是,通过高压的音质和高压的音压的语音控制,可以产生高压的音调的音调的音调。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

【PAISS 2021 教程】概率散度与生成式模型，92页ppt

专知会员服务

34+阅读 · 2021年11月30日

“CVPR 2021 接受论文列表 1663篇论文都在这了

专知会员服务

32+阅读 · 2021年6月12日

ICLR2021放榜了！ 687篇入选34篇得满分！ 48篇orals，108篇spotlights，531篇poster

专知会员服务

24+阅读 · 2021年1月13日

神经常微分方程教程，50页ppt，A brief tutorial on Neural ODEs

专知会员服务

74+阅读 · 2020年8月2日