Expressive voice conversion performs identity conversion for emotional speakers by jointly converting speaker identity and emotional style. Due to the hierarchical structure of speech emotion, it is challenging to disentangle the emotional style for different speakers. Inspired by the recent success of speaker disentanglement with variational autoencoder (VAE), we propose an any-to-any expressive voice conversion framework, that is called StyleVC. StyleVC is designed to disentangle linguistic content, speaker identity, pitch, and emotional style information. We study the use of style encoder to model emotional style explicitly. At run-time, StyleVC converts both speaker identity and emotional style for arbitrary speakers. Experiments validate the effectiveness of our proposed framework in both objective and subjective evaluations.
翻译:由于语言情感的等级结构,将不同语言的情绪风格分解起来是困难的。由于最近发言者与变异自动读数器(VAE)脱钩的成功,我们建议采用任何表达式语音转换框架,即StyleVC。StyleVC旨在分解语言内容、语音身份、投放和情感风格信息。我们研究使用风格编码器来明确模拟情感风格。在运行时,StyleVC将语言身份和情感风格转换为任意演讲者。实验验证了我们所提议的框架在客观和主观评价方面的有效性。