Deep latent variable generative models based on variational autoencoder (VAE) have shown promising performance for audiovisual speech enhancement (AVSE). The underlying idea is to learn a VAEbased audiovisual prior distribution for clean speech data, and then combine it with a statistical noise model to recover a speech signal from a noisy audio recording and video (lip images) of the target speaker. Existing generative models developed for AVSE do not take into account the sequential nature of speech data, which prevents them from fully incorporating the power of visual data. In this paper, we present an audiovisual deep Kalman filter (AV-DKF) generative model which assumes a first-order Markov chain model for the latent variables and effectively fuses audiovisual data. Moreover, we develop an efficient inference methodology to estimate speech signals at test time. We conduct a set of experiments to compare different variants of generative models for speech enhancement. The results demonstrate the superiority of the AV-DKF model compared with both its audio-only version and the non-sequential audio-only and audiovisual VAE-based models.
翻译:基于变异自动读取器(VAE)的深潜可变基因模型显示,视听语音增强(AVSE)的性能很有希望。基本想法是学习VAE基础的视听预发清洁语音数据,然后与统计噪音模型相结合,以便从目标演讲者的音响录音和视频(翻页图像)中恢复语音信号。为AVSE开发的现有基因模型没有考虑到语音数据的顺序性质,这使它们无法充分纳入视觉数据的力量。在本文中,我们展示了视听深Kalman过滤器(AV-DKF)的基因模型,它假定了潜伏变量的第一阶马尔科夫链模型,并有效地结合了视听数据。此外,我们还开发了一种高效的推论方法,用以在测试时估计语音信号。我们进行了一系列实验,以比较增强语音增强的变异变组合模型。结果显示AV-DKF模型与其音频版本和非顺序音频VAE基和视听VAE基模型相比的优越性。