Text-to-Speech (TTS) has recently seen great progress in synthesizing high-quality speech owing to the rapid development of parallel TTS systems, but producing speech with naturalistic prosodic variations, speaking styles and emotional tones remains challenging. Moreover, since duration and speech are generated separately, parallel TTS models still have problems finding the best monotonic alignments that are crucial for naturalistic speech synthesis. Here, we propose StyleTTS, a style-based generative model for parallel TTS that can synthesize diverse speech with natural prosody from a reference speech utterance. With novel Transferable Monotonic Aligner (TMA) and duration-invariant data augmentation schemes, our method significantly outperforms state-of-the-art models on both single and multi-speaker datasets in subjective tests of speech naturalness and speaker similarity. Through self-supervised learning of the speaking styles, our model can synthesize speech with the same prosodic and emotional tone as any given reference speech without the need for explicitly labeling these categories.
翻译:由于平行 TTS 系统的快速发展, 文本到语音(TTS) 近来在合成高质量语言方面取得了巨大进展, 但是, 以自然主义的偏差生成语言, 语言风格和情感调子仍然具有挑战性。 此外, 由于时间和语言是分开生成的, 平行 TTS 模型仍然难以找到对自然语言合成至关重要的最佳单调一致 。 在这里, 我们提议 StyTTS, 一种基于风格的平行 TTS 基因化模型, 可以将不同语言与参考语言表达的自然手势合成。 有了新颖的可转移单调和持续变化的数据增强计划, 我们的方法大大优于单调和多发式语言自然和类似语言的主观测试中的最新模型。 通过自我监管的语音风格学习, 我们的模式可以将语言与任何指定引用的演讲都具有相同的直观和情感调调音调, 而无需明确标注这些类别。