In this work, we address the problem of generating speech from silent lip videos for any speaker in the wild. In stark contrast to previous works, our method (i) is not restricted to a fixed number of speakers, (ii) does not explicitly impose constraints on the domain or the vocabulary and (iii) deals with videos that are recorded in the wild as opposed to within laboratory settings. The task presents a host of challenges, with the key one being that many features of the desired target speech, like voice, pitch and linguistic content, cannot be entirely inferred from the silent face video. In order to handle these stochastic variations, we propose a new VAE-GAN architecture that learns to associate the lip and speech sequences amidst the variations. With the help of multiple powerful discriminators that guide the training process, our generator learns to synthesize speech sequences in any voice for the lip movements of any person. Extensive experiments on multiple datasets show that we outperform all baselines by a large margin. Further, our network can be fine-tuned on videos of specific identities to achieve a performance comparable to single-speaker models that are trained on $4\times$ more data. We conduct numerous ablation studies to analyze the effect of different modules of our architecture. We also provide a demo video that demonstrates several qualitative results along with the code and trained models on our website: \url{http://cvit.iiit.ac.in/research/projects/cvit-projects/lip-to-speech-synthesis}}
翻译:在这项工作中,我们处理的是从任何野生发言者的静默嘴唇视频中制作演讲的问题。与以往的作品形成鲜明对比的是,我们的方法(一)不局限于固定的发言者人数,(二)不明确限制域名或词汇,(三)处理野外而不是实验室环境中录制的录像。任务提出了一系列挑战,关键在于,理想目标演讲的许多特征,如声音、音道和语言内容,不能完全从静默脸视频中推断出来。为了处理这些随机变异,我们建议建立一个新的VAE-GAN结构,在变异中学习将嘴语顺序与发言顺序联系起来。在多个强大的歧视者的帮助下,指导培训过程,我们的发电机学会将任何声音中的语音序列综合起来,供任何人的嘴动。关于多个数据集的广泛实验显示,我们大大超越了所有基线。此外,我们的网络可以对特定身份的视频进行微调,以取得与单一的模版模型相近的功能。我们和经过培训的图像模型相比,我们用多层次的模型来分析。我们用多层次的模型来分析。