Machine-generated speech is characterized by its limited or unnatural emotional variation. Current text to speech systems generates speech with either a flat emotion, emotion selected from a predefined set, average variation learned from prosody sequences in training data or transferred from a source style. We propose a text to speech(TTS) system, where a user can choose the emotion of generated speech from a continuous and meaningful emotion space (Arousal-Valence space). The proposed TTS system can generate speech from the text in any speaker's style, with fine control of emotion. We show that the system works on emotion unseen during training and can scale to previously unseen speakers given his/her speech sample. Our work expands the horizon of the state-of-the-art FastSpeech2 backbone to a multi-speaker setting and gives it much-coveted continuous (and interpretable) affective control, without any observable degradation in the quality of the synthesized speech.
翻译:机器生成的语音的特征是其有限的或非自然的情感变异。 语音系统的当前文字生成语音,要么是平坦的情感,要么是从预设的一组中选择的情感,从培训数据中从手动序列中学习的平均变化,要么是从源样式中传输的。 我们向语音(TTS)系统建议了一个文本,用户可以从连续和有意义的情感空间(Arozal-Valence 空间)中选择生成的语音的情感。提议的 TTS系统可以从任何发言者的风格中生成语音,同时对情感进行精细控制。 我们显示,该系统在培训期间对看不见的情感起作用,并根据他/她的语音样本,可以将系统的规模扩大到先前看不见的演讲者。 我们的工作将最先进的快速语音2主干线的视野扩展为多语音设置,并给予它非常精密的连续(和可解释的)影响控制,而不会在合成语音质量上出现任何可见的退化。