使用多个参考音频和风格的多参考音频和风格嵌入限制语言合成 (Using multiple reference audios and style embedding constraints for speech synthesis)

The end-to-end speech synthesis model can directly take an utterance as reference audio, and generate speech from the text with prosody and speaker characteristics similar to the reference audio. However, an appropriate acoustic embedding must be manually selected during inference. Due to the fact that only the matched text and speech are used in the training process, using unmatched text and speech for inference would cause the model to synthesize speech with low content quality. In this study, we propose to mitigate these two problems by using multiple reference audios and style embedding constraints rather than using only the target audio. Multiple reference audios are automatically selected using the sentence similarity determined by Bidirectional Encoder Representations from Transformers (BERT). In addition, we use ''target'' style embedding from a Pre-trained encoder as a constraint by considering the mutual information between the predicted and ''target'' style embedding. The experimental results show that the proposed model can improve the speech naturalness and content quality with multiple reference audios and can also outperform the baseline model in ABX preference tests of style similarity.

翻译：端到端的语音合成模型可以直接作为参考音频, 从文本中生成与引用音频相似的语音, 并生成与引用音频相似的音频。但是, 在推断过程中必须手工选择适当的音频嵌入。由于在培训过程中只使用匹配的文本和语音, 使用不匹配的文本和语音进行推理, 导致该模型以低内容质量合成语音。在这次研究中, 我们提议通过使用多个引用音频和样式嵌入限制, 而不是仅使用目标音频来缓解这两个问题。多引用音频会自动选择由来自变换器( BERT) 的双向 Eccoder 演示所决定的类似语句。此外, 我们使用“ 目标” 风格嵌入预训练的编码器作为制约, 其方法是考虑预测和“ 目标” 风格嵌入的相互信息。实验结果显示, 拟议的模型可以用多个参考音频来改进语言的自然性和内容质量, 并且也可以超越 ABX 风格首选项测试中的基线模型。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

【2021斯坦福新书】统计学思维，300页pdf

专知会员服务

121+阅读 · 2021年10月17日

【ACL2020】对抗性文本生成，Improving Adversarial Text Generation

专知会员服务

52+阅读 · 2020年5月5日

【知识图谱嵌入补全综述论文】embedding models for knowledge base completion

专知会员服务

102+阅读 · 2020年4月25日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日