Recently, diffusion-based generative models have been introduced to the task of speech enhancement. The corruption of clean speech is modeled as a fixed forward process in which increasing amounts of noise are gradually added. By learning to reverse this process in an iterative fashion conditioned on the noisy input, clean speech is generated. We build upon our previous work and derive the training task within the formalism of stochastic differential equations. We present a detailed theoretical review of the underlying score matching objective and explore different sampler configurations for solving the reverse process at test time. By using a sophisticated network architecture from natural image generation literature, we significantly improve performance compared to our previous publication. We also show that we can compete with recent discriminative models and achieve better generalization when evaluating on a different corpus than used for training. We complement the evaluation results with a subjective listening test, in which our proposed method is rated best. Furthermore, we show that the proposed method achieves remarkable state-of-the-art performance in single-channel speech dereverberation. Our code and audio examples are available online, see https://uhh.de/inf-sp-sgmse
翻译:最近,在语音增强任务中引入了基于传播的基因模型。清洁言语的腐败是一种固定的前方过程,逐渐增加越来越多的噪音。通过学习以噪音输入为条件的迭代方式扭转这一过程,就会产生清洁言语。我们以先前的工作为基础,在随机差异方程式的形式主义中开展培训任务。我们提出了对基本得分匹配目标的详细理论审查,并探索了在测试时解决反向过程的不同取样器配置。通过使用来自自然图像生成文献的精密网络结构,我们大大改进了与前一份出版物相比的绩效。我们还表明,我们可以与最近的歧视性模式竞争,并在评价一个不同于培训的主体时实现更好的概括化。我们用主观的倾听测试来补充评价结果,在这种测试中,我们建议的方法得到最好的评级。此外,我们展示了拟议方法在单声网话中达到显著的状态。我们的代码和音频实例可以在网上找到,见https://uh.de/inf-sgsemb-seration。