Traditional voice conversion(VC) has been focused on speaker identity conversion for speech with a neutral expression. We note that emotional expression plays an essential role in daily communication, and the emotional style of speech can be speaker-dependent. In this paper, we study the technique to jointly convert the speaker identity and speaker-dependent emotional style, that is called expressive voice conversion. We propose a StarGAN-based framework to learn a many-to-many mapping across different speakers, that takes into account speaker-dependent emotional style without the need for parallel data. To achieve this, we condition the generator on emotional style encoding derived from a pre-trained speech emotion recognition(SER) model. The experiments validate the effectiveness of our proposed framework in both objective and subjective evaluations. To our best knowledge, this is the first study on expressive voice conversion.
翻译:传统的语音转换( VC) 一直侧重于以中性表达方式转换发言者身份,我们注意到,情感表达在日常交流中起着重要作用,情感表达的风格可以依赖发言者。在本文中,我们研究了联合转换发言者身份和依赖发言者的情绪风格的方法,即所谓的表达式语音转换。我们提出了一个基于StarGAN的框架,以在不同发言者之间学习多种到多种的绘图,其中考虑到依赖发言者的情感风格,而不需要平行的数据。为了实现这一点,我们把生成者设置在来自预先培训的语音情感识别模式的情感风格编码上。实验验证了我们所提议的框架在客观和主观评价方面的有效性。据我们所知,这是关于表达式声音转换的首项研究。