Diffusion-based generative models have had a high impact on the computer vision and speech processing communities these past years. Besides data generation tasks, they have also been employed for data restoration tasks like speech enhancement and dereverberation. While discriminative models have traditionally been argued to be more powerful e.g. for speech enhancement, generative diffusion approaches have recently been shown to narrow this performance gap considerably. In this paper, we systematically compare the performance of generative diffusion models and discriminative approaches on different speech restoration tasks. For this, we extend our prior contributions on diffusion-based speech enhancement in the complex time-frequency domain to the task of bandwith extension. We then compare it to a discriminatively trained neural network with the same network architecture on three restoration tasks, namely speech denoising, dereverberation and bandwidth extension. We observe that the generative approach performs globally better than its discriminative counterpart on all tasks, with the strongest benefit for non-additive distortion models, like in dereverberation and bandwidth extension. Code and audio examples can be found online at https://uhh.de/inf-sp-sgmsemultitask
翻译:过去几年来,基于发包的基因模型对计算机视觉和语音处理社区产生了很大影响,除了数据生成任务外,还被用于语言增强和脱皮等数据恢复任务,虽然传统上认为歧视性模式在语言增强方面更为有力,但最近显示基因扩散方法大大缩小了这种性能差距。在本文中,我们系统地比较了基因传播模型和不同语言恢复任务的歧视方法的性能。在这方面,我们将我们以前对复杂时频域中基于传播的语音增强工作的贡献扩大到带宽任务。然后,我们将其与具有歧视性训练的神经网络和三个恢复任务的同一网络结构作比较,即语音淡化、脱皮和带宽度扩展。我们观察到,基因化方法在全球的表现优于其在所有任务上的歧视性对应方法,对非迭代扭曲模型的惠益最大,如皮肤变宽度和带宽扩展。可在网上https://uhh.de/inf-sp-sgmultiktas找到代码和音频实例。