EmoTTS女士:情感语言合成的多规模情感转移、预测和控制 (MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis)

Expressive synthetic speech is essential for many human-computer interaction and audio broadcast scenarios, and thus synthesizing expressive speech has attracted much attention in recent years. Previous methods performed the expressive speech synthesis either with explicit labels or with a fixed-length style embedding extracted from reference audio, both of which can only learn an average style and thus ignores the multi-scale nature of speech prosody. In this paper, we propose MsEmoTTS, a multi-scale emotional speech synthesis framework, to model the emotion from different levels. Specifically, the proposed method is a typical attention-based sequence-to-sequence model and with proposed three modules, including global-level emotion presenting module (GM), utterance-level emotion presenting module (UM), and local-level emotion presenting module (LM), to model the global emotion category, utterance-level emotion variation, and syllable-level emotion strength, respectively. In addition to modeling the emotion from different levels, the proposed method also allows us to synthesize emotional speech in different ways, i.e., transferring the emotion from reference audio, predicting the emotion from input text, and controlling the emotion strength manually. Extensive experiments conducted on a Chinese emotional speech corpus demonstrate that the proposed method outperforms the compared reference audio-based and text-based emotional speech synthesis methods on the emotion transfer speech synthesis and text-based emotion prediction speech synthesis respectively. Besides, the experiments also show that the proposed method can control the emotion expressions flexibly. Detailed analysis shows the effectiveness of each module and the good design of the proposed method.

翻译：在本文中,我们建议EmoTTS女士(一个多尺度的情感语言合成框架)从不同层面模拟情感。具体地说,拟议方法是一种典型的基于关注的顺序对顺序的表达式模型,并有拟议的三个模块,包括全球一级的情感展示模块(GM)、超声级情感演示模块(UM)和地方一级的情感展示模块(LM),分别用于模拟全球情感分类、超音级情绪变化和可调调调的情绪强度。除了从不同层面模拟情感外,拟议方法还允许我们以不同方式合成基于不同层次的情感。具体地说,拟议方法是一个典型的基于关注的顺序对顺序的表达模式,并有拟议的三个模块,包括全球一级的情感展示模块(GM)、超音级情感演示模块(UM)和本地级情感展示模块模块(LMM),分别用于模拟全球情感分类、超音频级情绪变化以及可调的情感强度。除了从不同层面模拟情感上模拟的情感表达方式外,拟议的方法还允许我们以不同方式合成情感语言的表达方式,即从参考音调音频表达的情感表达模式,从投入中预测,并控制每种语言分析,并分别展示对情绪语言的情感分析,并分析。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

【SIGIR2020-中科院】TAGNN: 基于会话推荐的目标注意力图神经网络，TAGNN: Target Attentive Graph Neural Networks for Session-based Recommendation

专知会员服务

42+阅读 · 2020年5月10日