Most neural text-to-speech (TTS) models require <speech, transcript> paired data from the desired speaker for high-quality speech synthesis, which limits the usage of large amounts of untranscribed data for training. In this work, we present Guided-TTS, a high-quality TTS model that learns to generate speech from untranscribed speech data. Guided-TTS combines an unconditional diffusion probabilistic model with a separately trained phoneme classifier for text-to-speech. By modeling the unconditional distribution for speech, our model can utilize the untranscribed data for training. For text-to-speech synthesis, we guide the generative process of the unconditional DDPM via phoneme classification to produce mel-spectrograms from the conditional distribution given transcript. We show that Guided-TTS achieves comparable performance with the existing methods without any transcript for LJSpeech. Our results further show that a single speaker-dependent phoneme classifier trained on multispeaker large-scale data can guide unconditional DDPMs for various speakers to perform TTS.
翻译:多数神经文本到语音模型要求从高品质语音合成所需的演讲者提供<speech, 抄录> 配对数据,这限制了大量未调出的培训数据的使用。 在这项工作中,我们展示了一个高质量的 TTS 模型,即向导-TTS 模型,该模型学习用未调出的语言数据生成语音。 向导- TTS 模型将无条件的传播概率模型与经过单独培训的语音到语音分类器结合起来。 通过对无条件分发语音模型进行模拟,我们的模型可以将未调出的数据用于培训。 对于文本到语音合成,我们通过电话分类指导无条件DDPM 的基因化过程,以便从有条件的发送给定稿中生成Mel-spectrograms。 我们显示, 向导- TTS 在没有LJSpeech 任何笔录的现有方法下取得了相似的性能。 我们的结果进一步显示,在多式语音大数据方面受过培训的单一依靠语音的电话分类者可以指导各种演讲者的无条件DDPMs。