Primary goal of an emotional voice conversion (EVC) system is to convert the emotion of a given speech signal from one style to another style without modifying the linguistic content of the signal. Most of the state-of-the-art approaches convert emotions for seen speaker-emotion combinations only. In this paper, we tackle the problem of converting the emotion of speakers whose only neutral data are present during the time of training and testing (i.e., unseen speaker-emotion combinations). To this end, we extend a recently proposed StartGANv2-VC architecture by utilizing dual encoders for learning the speaker and emotion style embeddings separately along with dual domain source classifiers. For achieving the conversion to unseen speaker-emotion combinations, we propose a Virtual Domain Pairing (VDP) training strategy, which virtually incorporates the speaker-emotion pairs that are not present in the real data without compromising the min-max game of a discriminator and generator in adversarial training. We evaluate the proposed method using a Hindi emotional database.
翻译:情感语音转换系统(EVC)的首要目标是将特定语音信号的情感从一种风格转换为另一种风格,而不会改变信号的语言内容。大多数最先进的方法只将情绪转换为可见的语音-情感组合。在本文中,我们处理的是在培训和测试期间唯一中立数据的发言者的情感转换问题(即,隐蔽的语音-情感组合)。为此,我们扩展了最近提议的 StartGANV2-VC 结构,利用双重编码器学习与双域源分类器分别的语音和情感风格嵌入。为了实现向看不见的语音-情感组合的转换,我们提议了一个虚拟多曼-情感组合(VDP)培训战略,该战略实际上纳入了真实数据中不存在的语音-情感配对,同时不影响对抗性训练中歧视者和发电机的微轴游戏。我们使用印地语情感数据库评估了拟议方法。