Audio-visual speech enhancement aims to extract clean speech from a noisy environment by leveraging not only the audio itself but also the target speaker's lip movements. This approach has been shown to yield improvements over audio-only speech enhancement, particularly for the removal of interfering speech. Despite recent advances in speech synthesis, most audio-visual approaches continue to use spectral mapping/masking to reproduce the clean audio, often resulting in visual backbones added to existing speech enhancement architectures. In this work, we propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture, and then converts them into waveform audio using a neural vocoder (HiFi-GAN). We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference. Our experiments show that LA-VocE outperforms existing methods according to multiple metrics, particularly under very noisy scenarios.
翻译:提高视听语言能力的目的是通过不仅利用音频本身,而且利用目标演讲者的嘴唇运动,从一个吵闹的环境里获取清洁的言语。这一方法表明,在只增加音频的言语能力方面,特别是在消除干扰性言语方面产生了改进。尽管最近在语音合成方面有所进展,但大多数视听方法继续使用光谱制图/制成像来复制清洁的音频,这往往导致将视觉骨干添加到现有的语音增强结构中。在这项工作中,我们提出LA-VocE,这是一个新的两阶段方法,通过变压器建筑预测音频视听演讲的美分谱,然后用电动电动电动电动电动电动声学(HIFi-GAN)将其转换为波形音频。我们用数千个发言者和11+种不同语言来培训和评价我们的框架,并研究我们的模型适应不同背景噪音和语音干扰水平的能力。我们的实验表明,LA-VocE根据多种计量方法,特别是在非常吵闹的情景下,超越现有方法。