Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity. Prior studies show that it is possible to disentangle emotional prosody using an encoder-decoder network conditioned on discrete representation, such as one-hot emotion labels. Such networks learn to remember a fixed set of emotional styles. In this paper, we propose a novel framework based on variational auto-encoding Wasserstein generative adversarial network (VAW-GAN), which makes use of a pre-trained speech emotion recognition (SER) model to transfer emotional style during training and at run-time inference. In this way, the network is able to transfer both seen and unseen emotional style to a new utterance. We show that the proposed framework achieves remarkable performance by consistently outperforming the baseline framework. This paper also marks the release of an emotional speech dataset (ESD) for voice conversion, which has multiple speakers and languages.
翻译:先前的研究显示,使用以单热情感标签等离散表达方式为条件的编码器-解码器网络,可以分解情感分裂,例如单热情感标签。这些网络学会记住固定的情感风格。在本文中,我们提出了一个基于变式自动编码瓦森斯坦对抗性基因化网络(VAW-GAN)的新框架,它利用预先训练的语音识别模式在培训和运行时推断中传递情感风格。这样,网络能够将可见和看不见的情感风格转换为新的表达方式。我们表明,拟议框架通过一贯地超越基线框架而取得显著的成绩。本文还标志着为语音转换而发布情感语音语音数据集(ESD),该数据集有多种语言和语言。