Denoising diffusion probabilistic models (DDPMs) are expressive generative models that have been used to solve a variety of speech synthesis problems. However, because of their high sampling costs, DDPMs are difficult to use in real-time speech processing applications. In this paper, we introduce DiffGAN-TTS, a novel DDPM-based text-to-speech (TTS) model achieving high-fidelity and efficient speech synthesis. DiffGAN-TTS is based on denoising diffusion generative adversarial networks (GANs), which adopt an adversarially-trained expressive model to approximate the denoising distribution. We show with multi-speaker TTS experiments that DiffGAN-TTS can generate high-fidelity speech samples within only 4 denoising steps. We present an active shallow diffusion mechanism to further speed up inference. A two-stage training scheme is proposed, with a basic TTS acoustic model trained at stage one providing valuable prior information for a DDPM trained at stage two. Our experiments show that DiffGAN-TTS can achieve high synthesis performance with only 1 denoising step.
翻译:“DiffGAN-TTS”是一个新的DiffGAN-TTS模型,基于DDPM-TTS(TTS)的文本到语音的新颖模型,实现了高不洁和高效的语音合成。DiffGAN-TTS(DTS)是建立在解密扩散变异对抗网络(GANs)基础上的,该模型采用了一种经过对抗性训练的表达式模型,以近似除色分布。我们通过多语调TTTS实验显示,DiffGAN-TTTS只能在4个分解步骤内生成高非异性语言样本。我们提出了一个积极的浅度传播机制,以进一步加快推断速度。我们提出了一个两阶段培训计划,在第一阶段培训了一个基本的 TTS 声学模型,为在第二阶段培训的DDPM提供宝贵的先前信息。我们的实验显示,DiffGAN-TTS能够以1个步骤实现高性合成。