Emotional voice conversion (EVC) aims to convert the emotional state of an utterance from one emotion to another while preserving the linguistic content and speaker identity. Current studies mostly focus on modelling the conversion between several specific emotion types. Synthesizing mixed effects of emotions could help us to better imitate human emotions, and facilitate more natural human-computer interaction. In this research, for the first time, we formulate and study the research problem of mixed emotion synthesis for EVC. We regard emotional styles as a series of emotion attributes that are learnt from a ranking-based support vector machine (SVM). Each attribute measures the degree of the relevance between the speech recordings belonging to different emotion types. We then incorporate those attributes into a sequence-to-sequence (seq2seq) emotional voice conversion framework. During the training, the framework not only learns to characterize the input emotional style, but also quantifies its relevance with other emotion types. At run-time, various emotional mixtures can be produced by manually defining the attributes. We conduct objective and subjective evaluations to validate our idea in terms of mixed emotion synthesis. We further build an emotion triangle as an application of emotion transition. Codes and speech samples are publicly available.
翻译:情感声音转换(EVC)旨在将情绪表达的情绪状态从一种情绪转换为另一种情绪,同时保留语言内容和发言者身份。当前研究主要侧重于模拟几种特定情感类型的转换。情感的混合效应可以帮助我们更好地模仿人类情感,促进更自然的人类-计算机互动。在这一研究中,我们首次为EVC制定和研究混合情感合成的研究问题。我们认为情感风格是从一个基于等级的支持矢量机(SVM)中学会的一系列情感属性。每个属性测量属于不同情感类型的语音记录的相关性程度。我们随后将这些属性纳入一个从顺序到顺序(seq2seq)的情感声音转换框架。在培训期间,框架不仅学会描述投入的情感风格,而且将其与其他情感类型的关联性量化。在运行时,可以通过手动定义属性来生成各种情感混合物。我们进行客观和主观评估,以证实我们混合情感合成的理念。我们进一步构建一个情感三角关系,作为情感转换的公开样本。