Prior works on improving speech quality with visual input typically study each type of auditory distortion separately (e.g., separation, inpainting, video-to-speech) and present tailored algorithms. This paper proposes to unify these subjects and study Generalized Speech Enhancement, where the goal is not to reconstruct the exact reference clean signal, but to focus on improving certain aspects of speech. In particular, this paper concerns intelligibility, quality, and video synchronization. We cast the problem as audio-visual speech resynthesis, which is composed of two steps: pseudo audio-visual speech recognition (P-AVSR) and pseudo text-to-speech synthesis (P-TTS). P-AVSR and P-TTS are connected by discrete units derived from a self-supervised speech model. Moreover, we utilize self-supervised audio-visual speech model to initialize P-AVSR. The proposed model is coined ReVISE. ReVISE is the first high-quality model for in-the-wild video-to-speech synthesis and achieves superior performance on all LRS3 audio-visual enhancement tasks with a single model. To demonstrates its applicability in the real world, ReVISE is also evaluated on EasyCom, an audio-visual benchmark collected under challenging acoustic conditions with only 1.6 hours of training data. Similarly, ReVISE greatly suppresses noise and improves quality. Project page: https://wnhsu.github.io/ReVISE.
翻译:先前关于提高语言质量的工作,包括视觉投入,通常分别研究每一种听觉扭曲(如分离、油漆、视频到语音)和当前的定制算法。本文件建议统一这些科目,研究通用语音增强,目的不是重建准确的参考清洁信号,而是侧重于改善演讲的某些方面。特别是,本文涉及智能、质量和视频同步。我们把问题作为视听语音合成问题,它由两个步骤组成:假视听语音识别(P-AVSR)和假文本到语音合成(P-TTSS) 。P-AVSR和P-TTSS通过独立单位连接,这些单位来自自我监督的语音模型,目的不是重建准确的参考清洁信号,而是侧重于改善某些语言的自我监督的视听演讲模式。拟议的模型是ReVISE。 模拟视频到视频合成的第一个高质量模型,在所有LRS3的文本到语音合成(PTTSE)上实现高级性表现。在真实的RS3质量模型下,SESESESE Areal-SE Arvial Astrual Astrual astrual strual strual astrual revial revial revial revial revial press a press a press a press a press a press a press a press a press a press a press a press a press a press a press a press a press a press a press a pressal pressal pressal press a pral press pressal press a pressal pressal pressal press a press a press a pressal pressal pressal press a pressal pressal press pressal pressal pressal pressal pressal pressal press press a press a pressal press a la la pressal pressal press a pressal pressal ex a ex a pressal pressal laess