An unsupervised text-to-speech synthesis (TTS) system learns to generate speech waveforms corresponding to any written sentence in a language by observing: 1) a collection of untranscribed speech waveforms in that language; 2) a collection of texts written in that language without access to any transcribed speech. Developing such a system can significantly improve the availability of speech technology to languages without a large amount of parallel speech and text data. This paper proposes an unsupervised TTS system based on an alignment module that outputs pseudo-text and another synthesis module that uses pseudo-text for training and real text for inference. Our unsupervised system can achieve comparable performance to the supervised system in seven languages with about 10-20 hours of speech each. A careful study on the effect of text units and vocoders has also been conducted to better understand what factors may affect unsupervised TTS performance. The samples generated by our models can be found at https://cactuswiththoughts.github.io/UnsupTTS-Demo, and our code can be found at https://github.com/lwang114/UnsupTTS.
翻译:一个未经监督的文本到语音合成系统(TTS)通过观察来学习生成一种语言中任何书面句子对应的语音波形:1)一个语言中未受管制的语音波形集;2)一个语言文本集,无法调出任何转录语音;2)一个语言文本集,无法查阅任何转录语音;开发这样一个系统可以大大改善语言语言语言的语音技术可用性,而没有大量的平行语音和文本数据。本文件建议建立一个不受监督的 TTS 系统,该系统基于一个校准模块,该模块输出假文本和另一个合成模块,使用假文本进行培训,并使用真实文本进行推断。我们未经监督的系统可以用七种语言实现与监督系统的类似性能,每个语言有大约10-20小时的语音。还仔细研究了文本单元和vocoders的效果,以更好地了解哪些因素可能影响不受监督的 TTS的性能。我们的模型产生的样本可以在 https://cactuswiths.github.io/UnsupTTS-Demo找到我们的代码,可在 https://github.sl/wang/SUnsupup.