多发言者多调多调的多调语音合成,带有边宽和风格分解 (Multi-Speaker Multi-Style Speech Synthesis with Timbre and Style Disentanglement)

Disentanglement of a speaker's timbre and style is very important for style transfer in multi-speaker multi-style text-to-speech (TTS) scenarios. With the disentanglement of timbres and styles, TTS systems could synthesize expressive speech for a given speaker with any style which has been seen in the training corpus. However, there are still some shortcomings with the current research on timbre and style disentanglement. The current method either requires single-speaker multi-style recordings, which are difficult and expensive to collect, or uses a complex network and complicated training method, which is difficult to reproduce and control the style transfer behavior. To improve the disentanglement effectiveness of timbres and styles, and to remove the reliance on single-speaker multi-style corpus, a simple but effective timbre and style disentanglement method is proposed in this paper. The FastSpeech2 network is employed as the backbone network, with explicit duration, pitch, and energy trajectory to represent the style. Each speaker's data is considered as a separate and isolated style, then a speaker embedding and a style embedding are added to the FastSpeech2 network to learn disentangled representations. Utterance level pitch and energy normalization are utilized to improve the decoupling effect. Experimental results demonstrate that the proposed model could synthesize speech with any style seen during training with high style similarity while maintaining very high speaker similarity.

翻译：发音器的音调和风格的分解对于多发式多发式多发式文本到语音(TTS)情景的风格转换非常重要。随着调音和风格的分解,TTS系统可以将特定发言者的表达式与培训文体中看到的任何风格合成。然而,目前对调音和风格分解的研究仍然存在一些缺陷。目前的方法要么需要单发式多发式多发式的多发式录音,而收集这些录音既难又昂贵,或者使用复杂的网络和复杂的培训方法,难以复制和控制风格转换行为。随着调音和风格的分解,TTS系统可以将特定发式的表达式的言调和风格分解。为了提高一个单发式多发式多发式的多发式发言人的调和风格,本文中提出了一种简单但有效的调音和风格的分解方法。快速Speech2网络可以用作主干式语音网络,有清晰的持续时间、倾斜度和能量轨迹以代表风格。每个发言者的数据都被视为一个单独和孤立的样式复制和控制的风格,然后以类似的风格复制和风格的风格复制,同时以演示式高发式格式学习的版本的形式学习和风格进行。随后的演示式嵌入和制式的图像到演示式式式式式的图像到显示。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日