Recently, diffusion models (DMs) have been increasingly used in audio processing tasks, including speech super-resolution (SR), which aims to restore high-frequency content given low-resolution speech utterances. This is commonly achieved by conditioning the network of noise predictor with low-resolution audio. In this paper, we propose a novel sampling algorithm that communicates the information of the low-resolution audio via the reverse sampling process of DMs. The proposed method can be a drop-in replacement for the vanilla sampling process and can significantly improve the performance of the existing works. Moreover, by coupling the proposed sampling method with an unconditional DM, i.e., a DM with no auxiliary inputs to its noise predictor, we can generalize it to a wide range of SR setups. We also attain state-of-the-art results on the VCTK Multi-Speaker benchmark with this novel formulation.
翻译:最近,传播模型(DMs)越来越多地用于音频处理任务,包括语音超分辨率(SR),其目的是恢复高频内容,因为低分辨率的语音话语发声,通常通过对噪音预测器网络进行低分辨率音频调节来实现这一点。在本文中,我们提议采用新型的抽样算法,通过DMs反向取样程序传播低分辨率音频信息。提议的方法可以是香草取样程序的现成替代,并大大改进现有工程的性能。此外,通过将拟议的取样方法与无条件的DMS(即没有向噪音预测器提供辅助投入的DM)相结合,我们可以将其推广到广泛的SR设置中。我们用这种新配方,还可以在VCTK多发言人基准上取得最先进的结果。