Score-based generative models (SGMs) have recently shown impressive results for difficult generative tasks such as the unconditional and conditional generation of natural images and audio signals. In this work, we extend these models to the complex short-time Fourier transform (STFT) domain, proposing a novel training task for speech enhancement using a complex-valued deep neural network. We derive this training task within the formalism of stochastic differential equations (SDEs), thereby enabling the use of predictor-corrector samplers. We provide alternative formulations inspired by previous publications on using generative diffusion models for speech enhancement, avoiding the need for any prior assumptions on the noise distribution and making the training task purely generative which, as we show, results in improved enhancement performance.
翻译:在这项工作中,我们将这些模型推广到复杂的短期Fourier变形(STFT)领域,提出利用复杂价值深厚的神经网络加强言语的新培训任务。我们把这一培训任务放在随机差异方程式(SDEs)中,从而能够使用预测器-校正器取样器。我们提供了由以往出版物启发的替代配方,这些配方涉及使用变异性传播模型加强语音,避免对噪音分布作任何先前的假设,并使培训任务纯粹是基因化的,正如我们所显示的那样,这种配方在改进增强性能方面取得成果。