Speech enhancement involves the distinction of a target speech signal from an intrusive background. Although generative approaches using Variational Autoencoders or Generative Adversarial Networks (GANs) have increasingly been used in recent years, normalizing flow (NF) based systems are still scarse, despite their success in related fields. Thus, in this paper we propose a NF framework to directly model the enhancement process by density estimation of clean speech utterances conditioned on their noisy counterpart. The WaveGlow model from speech synthesis is adapted to enable direct enhancement of noisy utterances in time domain. In addition, we demonstrate that nonlinear input companding benefits the model performance by equalizing the distribution of input samples. Experimental evaluation on a publicly available dataset shows comparable results to current state-of-the-art GAN-based approaches, while surpassing the chosen baselines using objective evaluation metrics.
翻译:虽然近年来越来越多地采用变式自动电解器或基因反转网络(GANs)的基因化方法,但基于流动的正常化系统仍然有伤疤,尽管它们在相关领域取得了成功。因此,在本文件中,我们提议了一个NF框架,通过以其吵闹的对口单位为条件的清洁语音话词的密度估计,直接模拟增强过程的密度估计;对语音合成的WaveGlow模型进行了调整,以便能够在时间领域直接增强噪音的发音。此外,我们证明,通过平衡输入样本的分布,非线性投入比较模型的性能也有利于模型的性能。对公开提供的数据集的实验性评估显示,其结果与目前最先进的GAN方法相当,同时使用客观的评价指标超过了选定的基线。