Emotional voice conversion (VC) aims to convert a neutral voice to an emotional (e.g. happy) one while retaining the linguistic information and speaker identity. We note that the decoupling of emotional features from other speech information (such as speaker, content, etc.) is the key to achieving remarkable performance. Some recent attempts about speech representation decoupling on the neutral speech can not work well on the emotional speech, due to the more complex acoustic properties involved in the latter. To address this problem, here we propose a novel Source-Filter-based Emotional VC model (SFEVC) to achieve proper filtering of speaker-independent emotion features from both the timbre and pitch features. Our SFEVC model consists of multi-channel encoders, emotion separate encoders, and one decoder. Note that all encoder modules adopt a designed information bottlenecks auto-encoder. Additionally, to further improve the conversion quality for various emotions, a novel two-stage training strategy based on the 2D Valence-Arousal (VA) space was proposed. Experimental results show that the proposed SFEVC along with a two-stage training strategy outperforms all baselines and achieves the state-of-the-art performance in speaker-independent emotional VC with nonparallel data.
翻译:情感声音转换(VC)的目的是将中性声音转换成情感(如快乐)声音,同时保留语言信息和发言者身份。我们注意到,将情感特征与其他语言信息(如演讲者、内容等)脱钩是取得显著表现的关键。最近一些试图在中性语言上进行语言表达脱钩的尝试,由于后者涉及较复杂的声学特性,对情感言论无法很好地发挥作用。为了解决这一问题,我们在这里提议了一个新的基于源代码的情感 VC 模式(SFEVC), 以便从音调和音调两个功能中适当过滤依赖语音的情感特征。我们的SFEVC 模式包括多频道编码器、情感独立的编码器和一个解码器。注意,所有编码模块都采用设计的信息瓶颈自动编码器。此外,为了进一步提高各种情感的转换质量,我们提议了一个基于2D Valence-Arousal(VA) 空间的新型两阶段培训战略。实验结果显示,拟议的SF-EVC 和SF-C 级演讲者在两个阶段性战略上都取得了全州级的状态。