Video-to-speech is the process of reconstructing the audio speech from a video of a spoken utterance. Previous approaches to this task have relied on a two-step process where an intermediate representation is inferred from the video, and is then decoded into waveform audio using a vocoder or a waveform reconstruction algorithm. In this work, we propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs) which translates spoken video to waveform end-to-end without using any intermediate representation or separate waveform synthesis algorithm. Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech, which is then fed to a waveform critic and a power critic. The use of an adversarial loss based on these two critics enables the direct synthesis of raw audio waveform and ensures its realism. In addition, the use of our three comparative losses helps establish direct correspondence between the generated audio and the input video. We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID, and that it is the first end-to-end model to produce intelligible speech for LRW (Lip Reading in the Wild), featuring hundreds of speakers recorded entirely `in the wild'. We evaluate the generated samples in two different scenarios -- seen and unseen speakers -- using four objective metrics which measure the quality and intelligibility of artificial speech. We demonstrate that the proposed approach outperforms all previous works in most metrics on GRID and LRW.
翻译:视频到语音是将音频语音从一个口头发言的视频重建出来的过程。 之前的任务方法依赖于一个两步过程, 从视频中推断出中间代表, 然后通过电动编码器或波形重建算法将它解码成波形音频。 在此工作中, 我们提议了一个新的端到端视频到语音模型, 以General Aversarial Network (GANs) 为基础, 将语音视频转换成波形端到端端端, 而不使用任何中间代表或不同的波形合成算法。 我们的模型包括一个以输入方式接收原始视频并生成语音的编码解码结构, 然后将其输入到波形批评器或波形重建算法。 使用基于这两个批评器的对抗性损失可以直接合成原始音频波形, 并确保其真实性。 此外, 我们的三种比较性损失有助于在生成的音频和输入视频中建立直接的语音和输入视频。 我们显示, 这个模型能够以显著的真实真实的真实真实的真实性来重建演讲, 用于像全球资源数据库那样的四种语言、 版本的图像中, 。