Recently, audio-visual speech enhancement has been tackled in the unsupervised settings based on variational auto-encoders (VAEs), where during training only clean data is used to train a generative model for speech, which at test time is combined with a noise model, e.g. nonnegative matrix factorization (NMF), whose parameters are learned without supervision. Consequently, the proposed model is agnostic to the noise type. When visual data are clean, audio-visual VAE-based architectures usually outperform the audio-only counterpart. The opposite happens when the visual data are corrupted by clutter, e.g. the speaker not facing the camera. In this paper, we propose to find the optimal combination of these two architectures through time. More precisely, we introduce the use of a latent sequential variable with Markovian dependencies to switch between different VAE architectures through time in an unsupervised manner: leading to switching variational auto-encoder (SwVAE). We propose a variational factorization to approximate the computationally intractable posterior distribution. We also derive the corresponding variational expectation-maximization algorithm to estimate the parameters of the model and enhance the speech signal. Our experiments demonstrate the promising performance of SwVAE.
翻译:最近,在基于变异自动显示器(VAEs)的不受监督的环境下,对视听语音的增强进行了处理。 在这种环境中,培训期间只使用清洁数据来训练一个语音变异模型,这种变异模型在测试时与噪音模型相结合,例如,非负矩阵因子化(NMF),其参数是不经监督而学习的。因此,拟议的模型对噪音类型具有不可知性。当视觉数据清洁时,基于视听VAE的建筑通常优于音异的对应结构时。当视觉数据被杂交损坏时,情况正好相反,例如,演讲者没有面对相机。在本文件中,我们提议通过时间找到这两种结构的最佳组合。更准确地说,我们采用与Markovian依赖的潜在的相近性变异变量,以非超强的方式在不同的VAE结构之间转换:导致转换变异自动编码(SwVAE)。我们建议采用变异因因因子化系数,以接近可计算性化的后台式模型分布信号,例如,发言者不面对镜头。我们还提出相应的SVAVAVE变式参数估计。