Emotional voice conversion (EVC) aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity. In this paper, we propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data. We note that the proposed EVC framework leverages text-to-speech (TTS) as they share a common goal that is to generate high-quality expressive voice. In stage 1, we perform style initialization with a multi-speaker TTS corpus, to disentangle speaking style and linguistic content. In stage 2, we perform emotion training with a limited amount of emotional speech data, to learn how to disentangle emotional style and linguistic information from the speech. The proposed framework can perform both spectrum and prosody conversion and achieves significant improvement over the state-of-the-art baselines in both objective and subjective evaluation.
翻译:情感语音转换(EVC)旨在改变语言表达的情绪状态,同时保留语言内容和语言特征。在本文中,我们提出一个新的两阶段培训战略,用于以数量有限的情感言语数据进行顺序到顺序的情感声音转换。我们注意到,拟议的EVC框架将文字转换为语音(TTS)作为杠杆,因为它们有一个共同的目标,即产生高质量的表达声音。在第1阶段,我们用多语种 TTTS 程序进行风格初始化,分解语音风格和语言内容。在第2阶段,我们用数量有限的情感言语数据进行情感培训,学习如何将情感风格和语言信息从演讲中分离开来。拟议的框架可以进行频谱和手动转换,并在客观和主观评价中大大改进最先进的基线。