Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity. In EVC, emotions are usually treated as discrete categories overlooking the fact that speech also conveys emotions with various intensity levels that the listener can perceive. In this paper, we aim to explicitly characterize and control the intensity of emotion. We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding. We further learn the actual emotion encoder from an emotion-labelled database and study the use of relative attributes to represent fine-grained emotion intensity. To ensure emotional intelligibility, we incorporate emotion classification loss and emotion embedding similarity loss into the training of the EVC network. As desired, the proposed network controls the fine-grained emotion intensity in the output speech. Through both objective and subjective evaluations, we validate the effectiveness of the proposed network for emotional expressiveness and emotion intensity control.
翻译:情感声音转换( EVC) 试图将语言表达的情绪状态转换为情绪状态,同时保留语言内容和声音特性。在 EVC 中,情感通常被视为分立的类别,忽略了语言表达也以听众能够感知的不同强度传递情感这一事实。在本文中,我们的目标是明确描述和控制情绪强度。我们提议将语调风格与语言内容脱钩,将语调风格编码为一种风格,嵌入一个形成情感嵌入原型的连续空间。我们进一步从情感标签数据库中学习实际情感编码器,并研究相对属性的使用以代表细微的情感强度。为了确保情感智能性,我们将情感分类损失和情感嵌入类似性损失纳入 EVC 网络的培训中。正如所希望的那样,拟议网络控制输出语调中的微微强度。我们通过客观和主观评价,验证了拟议网络在情感表达和情感强度控制方面的有效性。