Emotional speech synthesis aims to synthesize human voices with various emotional effects. The current studies are mostly focused on imitating an averaged style belonging to a specific emotion type. In this paper, we seek to generate speech with a mixture of emotions at run-time. We propose a novel formulation that measures the relative difference between the speech samples of different emotions. We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework. During the training, the framework does not only explicitly characterize emotion styles, but also explores the ordinal nature of emotions by quantifying the differences with other emotions. At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector. The objective and subjective evaluations have validated the effectiveness of the proposed framework. To our best knowledge, this research is the first study on modelling, synthesizing, and evaluating mixed emotions in speech.
翻译:情感言语合成旨在合成具有各种情感效应的人类声音。 目前的研究主要侧重于模仿属于特定情感类型的普通风格。 在本文中, 我们试图在运行时生成带有情感混合的言语。 我们提出一种新的配方, 测量不同情感的言语样本之间的相对差异。 然后将我们的配方纳入一个按顺序顺序排列的情感文本到语音框架。 在培训期间, 框架不仅明确描述情感风格, 而且还通过量化与其他情感的差异来探索情感的普通性质。 在运行时, 我们控制模型, 通过手动定义情感属性矢量来生成所想要的情感混合物。 客观和主观的评价证实了拟议框架的有效性。 根据我们的最佳知识, 这项研究是对模拟、 合成和 语言混合情感的首项研究。