Video-to-speech synthesis (also known as lip-to-speech) refers to the translation of silent lip movements into the corresponding audio. This task has received an increasing amount of attention due to its self-supervised nature (i.e., can be trained without manual labelling) combined with the ever-growing collection of audio-visual data available online. Despite these strong motivations, contemporary video-to-speech works focus mainly on small- to medium-sized corpora with substantial constraints in both vocabulary and setting. In this work, we introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder, which converts the mel-frequency spectrograms into waveform audio. We achieve state-of-the art results for GRID and considerably outperform previous approaches on LRW. More importantly, by focusing on spectrogram prediction using a simple feedforward model, we can efficiently and effectively scale our method to very large and unconstrained datasets: To the best of our knowledge, we are the first to show intelligible results on the challenging LRS3 dataset.
翻译:视频到语音合成(又称“唇对口”)是指将静默嘴唇运动转换成相应的音频。由于这一任务具有自我监督的性质(即无需人工贴标签即可培训),加上不断增多的在线视听数据收集工作(即无需人工贴标签即可接受培训),因此受到越来越多的关注。尽管存在这些强烈的动机,但当代视频到语音合成工作主要侧重于在词汇和设置两方面都存在严重制约的中小型公司。在这项工作中,我们引入了一个可扩缩的视频到语音框架,由两个部分组成:一个视频到频谱预测器和一个预先培训的神经电动电码,将中频谱转换成波形音频。我们实现了全球资源数据库的艺术成果,大大超越了以往在LRW上的做法。更重要的是,通过使用简单的Feforward模型侧重于光谱预测,我们可以高效率和有效地将我们的方法推广到非常大和不受控制的数据设置:为了最有挑战性的数据,我们第一次展示的是具有挑战性的数据。