可控语音合成的等级生成建模 (Hierarchical Generative Modeling for Controllable Speech Synthesis)

This paper proposes a neural sequence-to-sequence text-to-speech (TTS) model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions. The model is formulated as a conditional generative model based on the variational autoencoder (VAE) framework, with two levels of hierarchical latent variables. The first level is a categorical variable, which represents attribute groups (e.g. clean/noisy) and provides interpretability. The second level, conditioned on the first, is a multivariate Gaussian variable, which characterizes specific attribute configurations (e.g. noise level, speaking rate) and enables disentangled fine-grained control over these attributes. This amounts to using a Gaussian mixture model (GMM) for the latent distribution. Extensive evaluation demonstrates its ability to control the aforementioned attributes. In particular, we train a high-quality controllable TTS model on real found data, which is capable of inferring speaker and style attributes from a noisy utterance and use it to synthesize clean speech with controllable speaking style.

翻译：本文提出了一个神经序列到顺序文本到语音( TTS) 模型, 该模型可以控制生成的语音中的隐性属性, 而这种潜在属性在培训数据中很少加注, 如语音风格、口音、背景噪音和记录条件等。该模型是根据变异自动读数( VAE) 框架设计成一个有条件的基因变异模型, 具有两级等级潜伏变量。第一层是一个绝对变量, 代表属性组( 如清洁/ noisy) 并提供可解释性。第二层, 以第一层为条件, 是多变量, 用于描述特定属性配置( 如噪声水平、语音速度), 并且能够对这些属性进行分解细微的精细控制。这相当于在潜在分布中使用高斯混合模型( GMM ) 来显示它控制上述属性的能力。特别是, 我们用真实发现的数据来培训高品质可控 TTS 模型, 它能够从响音调的语音风格中推断出语音特性, 用它来合成清晰的语音控制。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

【MIT】条件说唱歌词生成与去噪自动编码器，Conditional Rap Lyrics Generation with Denoising Autoencoders

专知会员服务

16+阅读 · 2020年4月8日

用于大型遥感影像检索的深度学习，Deep Learning for Image Search and Retrieval in Large Remote Sensing Archives

专知会员服务

39+阅读 · 2020年4月6日

【CVPR2020-清华大学】渐进对抗网络的细粒度域适应，Progressive Adversarial Networks

专知会员服务

27+阅读 · 2020年4月4日

抢鲜看！13篇CVPR2020论文链接/开源代码/解读

专知会员服务

50+阅读 · 2020年2月26日