Denoising diffusion probabilistic models (DDPMs) have recently achieved leading performances in many generative tasks. However, the inherited iterative sampling process costs hindered their applications to speech synthesis. This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis. FastDiff employs a stack of time-aware location-variable convolutions of diverse receptive field patterns to efficiently model long-term time dependencies with adaptive conditions. A noise schedule predictor is also adopted to reduce the sampling steps without sacrificing the generation quality. Based on FastDiff, we design an end-to-end text-to-speech synthesizer, FastDiff-TTS, which generates high-fidelity speech waveforms without any intermediate feature (e.g., Mel-spectrogram). Our evaluation of FastDiff demonstrates the state-of-the-art results with higher-quality (MOS 4.28) speech samples. Also, FastDiff enables a sampling speed of 58x faster than real-time on a V100 GPU, making diffusion models practically applicable to speech synthesis deployment for the first time. We further show that FastDiff generalized well to the mel-spectrogram inversion of unseen speakers, and FastDiff-TTS outperformed other competing methods in end-to-end text-to-speech synthesis. Audio samples are available at \url{https://FastDiff.github.io/}.
翻译:传承的迭代抽样过程成本妨碍了语言合成的应用。本文建议采用快速Diff,这是高质量语音合成的一个快速有条件的传播模式。FastDiff使用一系列时间-感知地点可变的组合,不同可接受字段模式可以有效模拟具有适应性条件的长期时间依赖性。还采用了一个噪音时间表预测器,以减少取样步骤,同时不牺牲生成质量。在快速Diff的基础上,我们设计了一个终端到终端文本到语音合成器,FastDiff-TTS,它产生高纤维化语音波形,没有任何中间特性(例如,Mel-spectrograph)。我们对快速Diff的评估展示了质量更高的最新结果(MOS4.28)语音样本。此外,快速Diff使得取样速度比实时Viff/GPU的速度快58x,使传播模型实际适用于首次应用语音合成的语音合成器、FastDiff-Formal-Formax。我们进一步展示了快速版本的S-Greal-feral-Foration 方法。