The aim of this work is to investigate the impact of crossmodal self-supervised pre-training for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio and visual streams in videos. We propose LipSound2 which consists of an encoder-decoder architecture and location-aware attention mechanism to map face image sequences to mel-scale spectrograms directly without requiring any human annotations. The proposed LipSound2 model is firstly pre-trained on $\sim$2400h multi-lingual (e.g. English and German) audio-visual data (VoxCeleb2). To verify the generalizability of the proposed method, we then fine-tune the pre-trained model on domain-specific datasets (GRID, TCD-TIMIT) for English speech reconstruction and achieve a significant improvement on speech quality and intelligibility compared to previous approaches in speaker-dependent and -independent settings. In addition to English, we conduct Chinese speech reconstruction on the CMLR dataset to verify the impact on transferability. Lastly, we train the cascaded lip reading (video-to-text) system by fine-tuning the generated audios on a pre-trained speech recognition system and achieve state-of-the-art performance on both English and Chinese benchmark datasets.
翻译:这项工作的目的是,通过利用视频中视听流的自然共生作用,调查语言重建(视频到视听)的跨模式自我监督预培训(视频到视听)的影响。我们提议LipSound2, 其中包括一个编码器解码器架构和位置感知关注机制,将图像序列直接映射成市级光谱,而不需要任何人文说明。拟议的LipSound2模型首先对2400美元多语言(如英语和德语)视听数据(VoxCeleb.2)进行了预先培训,以核实拟议方法的可普遍适用性。我们然后对特定域数据集(GRID、TCD-TIMIT)预先培训的模式进行微调,用于英语语音重建,与以前在依赖语言和不依赖语言的环境中采用的方法相比,在语言质量和智能方面实现显著改进。除了英语外,我们还对CMLR数据集进行中文语音重建,以核实对可传输性的影响。最后,我们通过对英语升级的语音识别系统进行升级,对英语前的语音识别系统进行升级。