FastDiff: 高质量语音合成快速条件扩散模型 (FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis)

Denoising diffusion probabilistic models (DDPMs) have recently achieved leading performances in many generative tasks. However, the inherited iterative sampling process costs hindered their applications to speech synthesis. This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis. FastDiff employs a stack of time-aware location-variable convolutions of diverse receptive field patterns to efficiently model long-term time dependencies with adaptive conditions. A noise schedule predictor is also adopted to reduce the sampling steps without sacrificing the generation quality. Based on FastDiff, we design an end-to-end text-to-speech synthesizer, FastDiff-TTS, which generates high-fidelity speech waveforms without any intermediate feature (e.g., Mel-spectrogram). Our evaluation of FastDiff demonstrates the state-of-the-art results with higher-quality (MOS 4.28) speech samples. Also, FastDiff enables a sampling speed of 58x faster than real-time on a V100 GPU, making diffusion models practically applicable to speech synthesis deployment for the first time. We further show that FastDiff generalized well to the mel-spectrogram inversion of unseen speakers, and FastDiff-TTS outperformed other competing methods in end-to-end text-to-speech synthesis. Audio samples are available at \url{https://FastDiff.github.io/}.

翻译：传承的迭代抽样过程成本妨碍了语言合成的应用。本文建议采用快速Diff,这是高质量语音合成的一个快速有条件的传播模式。FastDiff使用一系列时间-感知地点可变的组合,不同可接受字段模式可以有效模拟具有适应性条件的长期时间依赖性。还采用了一个噪音时间表预测器,以减少取样步骤,同时不牺牲生成质量。在快速Diff的基础上,我们设计了一个终端到终端文本到语音合成器,FastDiff-TTS,它产生高纤维化语音波形,没有任何中间特性(例如,Mel-spectrograph)。我们对快速Diff的评估展示了质量更高的最新结果(MOS4.28)语音样本。此外,快速Diff使得取样速度比实时Viff/GPU的速度快58x,使传播模型实际适用于首次应用语音合成的语音合成器、FastDiff-Formal-Formax。我们进一步展示了快速版本的S-Greal-feral-Foration 方法。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

【CVPR 2022】盲图像超分辨率退化分布的研究，Learning the Degradation Distribution for Blind Image Super-Resolution

专知会员服务

7+阅读 · 2022年3月12日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日