Denoising diffusion probabilistic models (DDPMs) have recently achieved leading performances in many generative tasks. However, the inherited iterative sampling process costs hinder their applications to text-to-speech deployment. Through the preliminary study on diffusion model parameterization, we find that previous gradient-based TTS models require hundreds or thousands of iterations to guarantee high sample quality, which poses a challenge for accelerating sampling. In this work, we propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech. Unlike previous work estimating the gradient for data density, ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling. To tackle the model convergence challenge with decreased diffusion iterations, ProDiff reduces the data variance in the target site via knowledge distillation. Specifically, the denoising model uses the generated mel-spectrogram from an N-step DDIM teacher as the training target and distills the behavior into a new model with N/2 steps. As such, it allows the TTS model to make sharp predictions and further reduces the sampling time by orders of magnitude. Our evaluation demonstrates that ProDiff needs only 2 iterations to synthesize high-fidelity mel-spectrograms, while it maintains sample quality and diversity competitive with state-of-the-art models using hundreds of steps. ProDiff enables a sampling speed of 24x faster than real-time on a single NVIDIA 2080Ti GPU, making diffusion models practically applicable to text-to-speech synthesis deployment for the first time. Our extensive ablation studies demonstrate that each design in ProDiff is effective, and we further show that ProDiff can be easily extended to the multi-speaker setting. Audio samples are available at \url{https://ProDiff.github.io/.}
翻译:通过对扩散模型参数化的初步研究,我们发现以前基于梯度的TTS模型需要数百或数千次迭代,以保证高样本质量,这对加速采样构成挑战。在这项工作中,我们提议ProDiff,关于高质量文本到语音的渐进快速传播模型。与以往估算数据密度梯度的工作不同,ProDiff 参数化了淡化模型,直接预测清洁数据以避免加速采样过程中明显的质量退化。通过对扩散模型参数化的初步研究,我们发现以前基于梯度的TTS模型需要数百或数千次迭代,以保证高样本质量。具体地说,脱色模型将N级DDIM教师生成的Mel-spectrogrogram作为培训目标,将行为推向N/2级新模型化。因此,允许TTS模型直接应用清洁数据化模型,以避免加速采样质量降解。ProD的精确速度步骤化了模型,同时进一步将高数值化了我们系统的系统化,从而显示我们系统内部的单个时间序列。