Multi-speaker text-to-speech (TTS) using a few adaption data is a challenge in practical applications. To address that, we propose a zero-shot multi-speaker TTS, named nnSpeech, that could synthesis a new speaker voice without fine-tuning and using only one adaption utterance. Compared with using a speaker representation module to extract the characteristics of new speakers, our method bases on a speaker-guided conditional variational autoencoder and can generate a variable Z, which contains both speaker characteristics and content information. The latent variable Z distribution is approximated by another variable conditioned on reference mel-spectrogram and phoneme. Experiments on the English corpus, Mandarin corpus, and cross-dataset proves that our model could generate natural and similar speech with only one adaption speech.
翻译:多发式文本到语音(TTS)使用几个调适数据是实际应用中的一项挑战。为了解决这个问题,我们提议采用名为nnSpeech的零点多发式多发式TTS(NnSpeech),该TTS可以将一个新的发言者声音合成而无需微调,只使用一个调适话语句。与使用一个扩音代表模块来提取新喇叭的特性相比,我们的方法基于一个按喇叭引导的有条件变换自动编码器,并能够产生一个包含发言者特点和内容信息的变量Z。潜伏变量Z的分布被另一个以参考中光谱和电话线为条件的变量所近似。英国文体实验、曼达林资料库和交叉数据集证明,我们的模型可以产生自然和相似的语音,只有一种调适语音。