Dynamical variational auto-encoders (DVAEs) are a class of deep generative models with latent variables, dedicated to time series data modeling. DVAEs can be considered as extensions of the variational autoencoder (VAE) that include the modeling of temporal dependencies between successive observed and/or latent vectors in data sequences. Previous work has shown the interest of DVAEs and their better performance over the VAE for speech signals (spectrogram) modeling. Independently, the VAE has been successfully applied to speech enhancement in noise, in an unsupervised noise-agnostic set-up that does not require the use of a parallel dataset of clean and noisy speech samples for training, but only requires clean speech signals. In this paper, we extend those works to DVAE-based single-channel unsupervised speech enhancement, hence exploiting both speech signals unsupervised representation learning and dynamics modeling. We propose an unsupervised speech enhancement algorithm based on the most general form of DVAEs, that we then adapt to three specific DVAE models to illustrate the versatility of the framework. More precisely, we combine DVAE-based speech priors with a noise model based on nonnegative matrix factorization, and we derive a variational expectation-maximization (VEM) algorithm to perform speech enhancement. Experimental results show that the proposed approach based on DVAEs outperforms its VAE counterpart and a supervised speech enhancement baseline.
翻译:DVAE是一种具有潜伏变量的深层基因化模型,专门用于时间序列数据模型。DVAE可被视为变异自动编码器(VAE)的延伸,其中包括数据序列中连续观测到和/或潜载矢量之间的时间依赖模型。以前的工作表明DVAE对语音信号(频谱)建模的兴趣及其比VAE更好的性能。独立地,VAE已成功地应用于噪音语音增强,在不受监督的噪音-感知性结构中,不需要使用清洁和噪音语音样本的平行数据集来进行培训,但只需要干净的语音信号。在本文中,我们将这些作品扩展至基于 DVAE 的单声波增强,从而利用语音信号不超强的演示和动态建模。我们提议基于DVAE 最普通的语音变异性分析法(基于DVAE的最普遍形式,我们随后根据三个特定的DVAE 变异性语音变异性模型进行升级,我们随后根据前的DVAE 变异性变制的DVAE 模型来显示一种非VAF。