Dynamical variational autoencoders (DVAEs) are a class of deep generative models with latent variables, dedicated to model time series of high-dimensional data. DVAEs can be considered as extensions of the variational autoencoder (VAE) that include temporal dependencies between successive observed and/or latent vectors. Previous work has shown the interest of using DVAEs over the VAE for speech spectrograms modeling. Independently, the VAE has been successfully applied to speech enhancement in noise, in an unsupervised noise-agnostic set-up that requires neither noise samples nor noisy speech samples at training time, but only requires clean speech signals. In this paper, we extend these works to DVAE-based single-channel unsupervised speech enhancement, hence exploiting both speech signals unsupervised representation learning and dynamics modeling. We propose an unsupervised speech enhancement algorithm that combines a DVAE speech prior pre-trained on clean speech signals with a noise model based on nonnegative matrix factorization, and we derive a variational expectation-maximization (VEM) algorithm to perform speech enhancement. The algorithm is presented with the most general DVAE formulation and is then applied with three specific DVAE models to illustrate the versatility of the framework. Experimental results show that the proposed DVAE-based approach outperforms its VAE-based counterpart, as well as several supervised and unsupervised noise-dependent baselines, especially when the noise type is unseen during training.
翻译:动态动态自动变换器( DVAE) 是一组具有潜伏变量的深噪音变异模型, 专门用于模拟高维数据的时间序列。 DVAE 可以被视为变异自动变换器( VAE)的延伸, 其中包括连续观测到和/或潜潜伏矢量之间的时间依赖性。 先前的工作表明, 使用 DVAE 进行语音光谱建模的兴趣。 独立地, VAE 已被成功应用到语音增强中, 其语言增强是建立在DVAE 上之前的语音增强法, 以非反向矩阵化为基础, 且不需在培训时需要噪声- 声音采集器样本, 仅需要清洁的语音信号信号 。 在本文中, 我们将这些作品扩展到基于 DVAE 的单层自动变异位器增强, 从而利用无监控的语音显示和动态建模的两种语音增强法。 我们提出一个不超强的语音增强算法, 在非反向矩阵化的矩阵化时, 我们特别用一个变式的变压式的变式的变式的变式的语音变式演算法 。