向导-TTS: 未经发音的语音文本到语音 (Guided-TTS:Text-to-Speech with Untranscribed Speech)

Most neural text-to-speech (TTS) models require <speech, transcript> paired data from the desired speaker for high-quality speech synthesis, which limits the usage of large amounts of untranscribed data for training. In this work, we present Guided-TTS, a high-quality TTS model that learns to generate speech from untranscribed speech data. Guided-TTS combines an unconditional diffusion probabilistic model with a separately trained phoneme classifier for text-to-speech. By modeling the unconditional distribution for speech, our model can utilize the untranscribed data for training. For text-to-speech synthesis, we guide the generative process of the unconditional DDPM via phoneme classification to produce mel-spectrograms from the conditional distribution given transcript. We show that Guided-TTS achieves comparable performance with the existing methods without any transcript for LJSpeech. Our results further show that a single speaker-dependent phoneme classifier trained on multispeaker large-scale data can guide unconditional DDPMs for various speakers to perform TTS.

翻译：多数神经文本到语音模型要求从高品质语音合成所需的演讲者提供<speech, 抄录> 配对数据,这限制了大量未调出的培训数据的使用。在这项工作中,我们展示了一个高质量的 TTS 模型,即向导-TTS 模型,该模型学习用未调出的语言数据生成语音。向导- TTS 模型将无条件的传播概率模型与经过单独培训的语音到语音分类器结合起来。通过对无条件分发语音模型进行模拟,我们的模型可以将未调出的数据用于培训。对于文本到语音合成,我们通过电话分类指导无条件DDPM 的基因化过程,以便从有条件的发送给定稿中生成Mel-spectrograms。我们显示, 向导- TTS 在没有LJSpeech 任何笔录的现有方法下取得了相似的性能。我们的结果进一步显示,在多式语音大数据方面受过培训的单一依靠语音的电话分类者可以指导各种演讲者的无条件DDPMs。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。