This paper proposes an effective emotional text-to-speech (TTS) system with a pre-trained language model (LM)-based emotion prediction method. Unlike conventional systems that require auxiliary inputs such as manually defined emotion classes, our system directly estimates emotion-related attributes from the input text. Specifically, we utilize generative pre-trained transformer (GPT)-3 to jointly predict both an emotion class and its strength in representing emotions coarse and fine properties, respectively. Then, these attributes are combined in the emotional embedding space and used as conditional features of the TTS model for generating output speech signals. Consequently, the proposed system can produce emotional speech only from text without any auxiliary inputs. Furthermore, because the GPT-3 enables to capture emotional context among the consecutive sentences, the proposed method can effectively handle the paragraph-level generation of emotional speech.
翻译:本文提出了一个有效的情感文本到语音系统(TTS), 包含一种预先培训的语言模型(LM)为基础的情感预测方法。 与需要人工定义情感类等辅助投入的传统系统不同, 我们的系统直接估计输入文本中的情感相关属性。 具体地说, 我们使用基因化预培训变压器(GPT)-3 来共同预测情感类及其力量, 分别代表情绪粗糙和细微的特性。 然后, 这些属性被合并在情感嵌入空间中, 并用作TTS模型中生成输出语言信号的有条件特征。 因此, 拟议的系统只能从文本中产生情感言论, 而无需任何辅助投入。 此外, 由于 GPT-3 能够捕捉连续几句中的情感背景, 拟议的方法可以有效地处理单级情感语言的生成。