通过电话级别内容-调频解析在文本到语音合成合成中精美的风格建模、传输和预测 (Fine-grained Style Modeling, Transfer and Prediction in Text-to-Speech Synthesis via Phone-Level Content-Style Disentanglement)

This paper presents a novel design of neural network system for fine-grained style modeling, transfer and prediction in expressive text-to-speech (TTS) synthesis. Fine-grained modeling is realized by extracting style embeddings from the mel-spectrograms of phone-level speech segments. Collaborative learning and adversarial learning strategies are applied in order to achieve effective disentanglement of content and style factors in speech and alleviate the "content leakage" problem in style modeling. The proposed system can be used for varying-content speech style transfer in the single-speaker scenario. The results of objective and subjective evaluation show that our system performs better than other fine-grained speech style transfer models, especially in the aspect of content preservation. By incorporating a style predictor, the proposed system can also be used for text-to-speech synthesis. Audio samples are provided for system demonstration https://daxintan-cuhk.github.io/pl-csd-speech .

翻译：本文展示了一种新颖的神经网络系统设计,用于在语音文本到语音合成中进行细微的风格建模、传输和预测。通过从电话语音部分的Mel-spectrographes嵌入功能,实现了精美的建模。合作学习和对抗性学习战略的应用是为了在语音中有效地解开内容和风格因素,缓解风格建模中的“内容渗漏”问题。提议的系统可用于单声频情景中不同内容的语音风格转移。客观和主观评估的结果显示,我们的系统比其他精美语音风格转移模式表现得更好,特别是在内容保护方面。通过采用风格预测器,拟议的系统也可以用于文本到语音合成。为系统演示 https://daxintan-cuhk.github.io/pl-cd-speech提供了音样样本。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

【CIKM2020】推荐系统的神经模板解释生成

专知会员服务

34+阅读 · 2020年9月9日

【MIT】反偏差对比学习，Debiased Contrastive Learning

专知会员服务

91+阅读 · 2020年7月4日

贝叶斯网络在医疗的应用综述：过去，现在和未来 | A Comprehensive Scoping Review of Bayesian Networks in Healthcare: Past, Present and Future

专知会员服务

41+阅读 · 2020年2月26日

【CVPR2020】用于细粒度动作识别的多模式域自适应，Multi-Modal Domain Adaptation for Fine-Grained Action Recognition

专知会员服务

78+阅读 · 2020年2月25日