Unsupervised speech enhancement based on variational autoencoders has shown promising performance compared with the commonly used supervised methods. This approach involves the use of a pre-trained deep speech prior along with a parametric noise model, where the noise parameters are learned from the noisy speech signal with an expectationmaximization (EM)-based method. The E-step involves an intractable latent posterior distribution. Existing algorithms to solve this step are either based on computationally heavy Monte Carlo Markov Chain sampling methods and variational inference, or inefficient optimization-based methods. In this paper, we propose a new approach based on Langevin dynamics that generates multiple sequences of samples and comes with a total variation-based regularization to incorporate temporal correlations of latent vectors. Our experiments demonstrate that the developed framework makes an effective compromise between computational efficiency and enhancement quality, and outperforms existing methods.
翻译:与常用的监督方法相比,基于变异自动电解器的不受监督的语音增强工作表现良好,与常用的监管方法相比,这一方法涉及使用事先经过训练的深层语音以及参数噪音模型,该模型从噪音参数从噪音语音信号中学习,以预期最大化法为基础。电子步骤涉及一种棘手的潜在后遗物分布。解决这一步骤的现有算法要么基于计算重的蒙特-卡洛-马尔科夫链取样法和变异推断法,要么基于效率低的优化法。在本文件中,我们提出了一个基于Langevin动态的新方法,该方法产生多个样本序列,并带有基于全面变异的规范,以纳入潜在矢量的时间相关性。我们的实验表明,开发的框架在计算效率和增强质量之间做出了有效的折中,并且超越了现有方法。