用于端对端语音合成的可控制的跨声音感应感应传输 (Controllable cross-speaker emotion transfer for end-to-end speech synthesis)

The cross-speaker emotion transfer task in TTS particularly aims to synthesize speech for a target speaker with the emotion transferred from reference speech recorded by another (source) speaker. During the emotion transfer process, the identity information of the source speaker could also affect the synthesized results, resulting in the issue of speaker leakage. This paper proposes a new method with the aim to synthesize controllable emotional expressive speech and meanwhile maintain the target speaker's identity in the cross-speaker emotion TTS task. The proposed method is a Tacotron2-based framework with the emotion embedding as the conditioning variable to provide emotion information. Two emotion disentangling modules are contained in our method to 1) get speaker-independent and emotion-discriminative embedding, and 2) explicitly constrain the emotion and speaker identity of synthetic speech to be that as expected. Moreover, we present an intuitive method to control the emotional strength in the synthetic speech for the target speaker. Specifically, the learned emotion embedding is adjusted with a flexible scalar value, which allows controlling the emotion strength conveyed by the embedding. Extensive experiments have been conducted on a Mandarin disjoint corpus, and the results demonstrate that the proposed method is able to synthesize reasonable emotional speech for the target speaker. Compared to the state-of-the-art reference embedding learned methods, our method gets the best performance on the cross-speaker emotion transfer task, indicating that our method achieves the new state-of-the-art performance on learning the speaker-independent emotion embedding. Furthermore, the strength ranking test and pitch trajectories plots demonstrate that the proposed method can effectively control the emotion strength, leading to prosody-diverse synthetic speech.

翻译：TTS 的跨声音情感传输任务, 特别是旨在将目标演讲者的语音与另一个(源)演讲者所录参考演讲中传来的情感混合在一起。在情感传输过程中, 源演讲者的身份信息也会影响合成结果, 导致演讲者泄漏问题。本文提出一种新的方法, 目的是将可控的情绪表达式整合起来, 同时在跨声音情感 TTS 任务中保持目标演讲者的身份。提议的方法是一个基于Tacotron2- 的框架, 情感嵌入为提供情绪信息的调控变异器。我们的方法中包含两个情感分解模块, 以便1) 使语言独立和情感偏差嵌入化, 2) 源演讲者的身份信息也会影响合成结果的合成结果。此外, 我们提出了一种直觉方法, 来控制合成演讲者情感表达的情绪强度。所学情感嵌入式的情感嵌入为灵活的调控器, 嵌入式语言中, 正在对曼达林的情绪流化变异性分析器进行广泛的实验, 将智能转换为我们的语言变换方法, 将智能变换为智能变换为智能变压方法, 。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

【干货书】计算机科学家的数学，153页pdf

专知会员服务

176+阅读 · 2021年7月27日

无监督学习：深度生成模型，35页ppt

专知会员服务

42+阅读 · 2021年7月4日

【CVPR 2021】姿态可控的语音驱动说话人脸

专知会员服务

16+阅读 · 2021年5月13日