Expressive synthetic speech is essential for many human-computer interaction and audio broadcast scenarios, and thus synthesizing expressive speech has attracted much attention in recent years. Previous methods performed the expressive speech synthesis either with explicit labels or with a fixed-length style embedding extracted from reference audio, both of which can only learn an average style and thus ignores the multi-scale nature of speech prosody. In this paper, we propose MsEmoTTS, a multi-scale emotional speech synthesis framework, to model the emotion from different levels. Specifically, the proposed method is a typical attention-based sequence-to-sequence model and with proposed three modules, including global-level emotion presenting module (GM), utterance-level emotion presenting module (UM), and local-level emotion presenting module (LM), to model the global emotion category, utterance-level emotion variation, and syllable-level emotion strength, respectively. In addition to modeling the emotion from different levels, the proposed method also allows us to synthesize emotional speech in different ways, i.e., transferring the emotion from reference audio, predicting the emotion from input text, and controlling the emotion strength manually. Extensive experiments conducted on a Chinese emotional speech corpus demonstrate that the proposed method outperforms the compared reference audio-based and text-based emotional speech synthesis methods on the emotion transfer speech synthesis and text-based emotion prediction speech synthesis respectively. Besides, the experiments also show that the proposed method can control the emotion expressions flexibly. Detailed analysis shows the effectiveness of each module and the good design of the proposed method.
翻译:在本文中,我们建议EmoTTS女士(一个多尺度的情感语言合成框架)从不同层面模拟情感。具体地说,拟议方法是一种典型的基于关注的顺序对顺序的表达式模型,并有拟议的三个模块,包括全球一级的情感展示模块(GM)、超声级情感演示模块(UM)和地方一级的情感展示模块(LM),分别用于模拟全球情感分类、超音级情绪变化和可调调调的情绪强度。除了从不同层面模拟情感外,拟议方法还允许我们以不同方式合成基于不同层次的情感。具体地说,拟议方法是一个典型的基于关注的顺序对顺序的表达模式,并有拟议的三个模块,包括全球一级的情感展示模块(GM)、超音级情感演示模块(UM)和本地级情感展示模块模块(LMM),分别用于模拟全球情感分类、超音频级情绪变化以及可调的情感强度。除了从不同层面模拟情感上模拟的情感表达方式外,拟议的方法还允许我们以不同方式合成情感语言的表达方式,即从参考音调音频表达的情感表达模式,从投入中预测,并控制每种语言分析,并分别展示对情绪语言的情感分析,并分析。