Although deep neural network (DNN)-based speech enhancement (SE) methods outperform the previous non-DNN-based ones, they often degrade the perceptual quality of generated outputs. To tackle this problem, We introduce a DNN-based generative refiner aiming to improve perceptual speech quality pre-processed by an SE method. As the refiner, we train a diffusion-based generative model by utilizing a dataset consisting of clean speech only. Then, the model replaces the degraded and distorted parts caused by a preceding SE method with newly generated clean parts by denoising diffusion restoration. Once our refiner is trained on a set of clean speech, it can be applied to various SE methods without additional training specialized for each SE module. Therefore, our refiner can be a versatile post-processing module w.r.t. SE methods and has high potential in terms of modularity. Experimental results show that our method improved perceptual speech quality regardless of the preceding SE methods used.
翻译:尽管深神经网络(DNN)基于深度神经语言增强(SE)的方法比以前非DNN语言增强(SE)的方法要好,但它们往往会降低产出的感知质量。为了解决这个问题,我们引入了一种基于DNN的基因改进器,目的是通过SE方法改进感知语言质量的预处理。作为精炼器,我们通过使用由纯净言组成的数据集来培训基于扩散的基因化模型。然后,模型通过拆解扩散恢复,将先前SE方法造成的退化和扭曲部分替换为新产生的清洁部分。一旦我们的精炼者接受了一套清洁语言的培训,它就可以应用到各种SE方法中,而无需对每个SE模块进行额外的专门培训。因此,我们的精炼器可以是多功能的后处理模块(w.r.t.SEE),在模块方面具有很高的潜力。实验结果表明,我们的方法提高了感知性语言质量,而不论前一种SEE方法是如何使用的。