We introduce the first unsupervised speech synthesis system based on a simple, yet effective recipe. The framework leverages recent work in unsupervised speech recognition as well as existing neural-based speech synthesis. Using only unlabeled speech audio and unlabeled text as well as a lexicon, our method enables speech synthesis without the need for a human-labeled corpus. Experiments demonstrate the unsupervised system can synthesize speech similar to a supervised counterpart in terms of naturalness and intelligibility measured by human evaluation.
翻译:我们引入了基于简单而有效的配方的首个不受监督的语音合成系统。 该框架在不受监督的语音识别和现有的神经化语音合成中利用了最近的工作。 我们的方法仅使用无标签的语音音频和无标签文本以及词汇,就使得语音合成无需人类标签。 实验表明,不受监督的系统可以将语言合成为与受监督的对应方类似的语言,通过人类评估测量其自然性和智能性。