Diffusion model, as a new generative model which is very popular in image generation and audio synthesis, is rarely used in speech enhancement. In this paper, we use the diffusion model as a module for stochastic refinement. We propose SRTNet, a novel method for speech enhancement via Stochastic Refinement in complete Time domain. Specifically, we design a joint network consisting of a deterministic module and a stochastic module, which makes up the ``enhance-and-refine'' paradigm. We theoretically demonstrate the feasibility of our method and experimentally prove that our method achieves faster training, faster sampling and higher quality. Our code and enhanced samples are available at https://github.com/zhibinQiu/SRTNet.git.
翻译:传播模型作为一种在图像生成和音频合成中非常流行的新基因模型,很少用于语言增强。在本文中,我们使用扩散模型作为进行随机改进的模块。我们建议SRTNet,这是通过完全时间域的Stochacast Refinement来增强语音的一种新颖方法。具体地说,我们设计了一个联合网络,由确定型模块和一个随机模块组成,构成“增强和修复”模式。我们理论上展示了我们的方法的可行性,并实验性地证明我们的方法能够更快地培训、更快地取样和更高质量。我们的代码和强化样本可以在https://github.com/zhibinQiu/SRTNet.git上查阅。