Expressive voice conversion performs identity conversion for emotional speakers by jointly converting speaker identity and speaker-dependent emotion style. Due to the hierarchical structure of speech emotion, it is challenging to disentangle the speaker-dependent emotional style for expressive voice conversion. Motivated by the recent success on speaker disentanglement with variational autoencoder (VAE), we propose an expressive voice conversion framework which can effectively disentangle linguistic content, speaker identity, pitch, and emotional style information. We study the use of emotion encoder to model emotional style explicitly, and introduce mutual information (MI) losses to reduce the irrelevant information from the disentangled emotion representations. At run-time, our proposed framework can convert both speaker identity and speaker-dependent emotional style without the need for parallel data. Experimental results validate the effectiveness of our proposed framework in both objective and subjective evaluations.
翻译:表达式声音转换通过联合转换演讲人身份和依赖演讲人的情绪风格,使感官人的身份转换。由于语言情感的等级结构,将依赖演讲人的情感风格分解为表达式声音转换具有挑战性。由于最近发言者与变异自动读数器(VAE)脱钩的成功,我们提议了一个表达式声音转换框架,可以有效地分解语言内容、声音身份、声调和情感风格信息。我们研究如何使用情感编码来明确模拟情感风格,并引入相互信息(MI)损失,以减少来自分解的情感表达的不相干的信息。在运行时,我们提议的框架可以转换发言者的身份和依赖演讲人的情感风格,而不需要平行数据。实验结果验证了我们拟议框架在客观和主观评价方面的有效性。