GANTRO: 情感语音合成与创性反对性网络 (GANtron: Emotional Speech Synthesis with Generative Adversarial Networks)

Speech synthesis is used in a wide variety of industries. Nonetheless, it always sounds flat or robotic. The state of the art methods that allow for prosody control are very cumbersome to use and do not allow easy tuning. To tackle some of these drawbacks, in this work we target the implementation of a text-to-speech model where the inferred speech can be tuned with the desired emotions. To do so, we use Generative Adversarial Networks (GANs) together with a sequence-to-sequence model using an attention mechanism. We evaluate four different configurations considering different inputs and training strategies, study them and prove how our best model can generate speech files that lie in the same distribution as the initial training dataset. Additionally, a new strategy to boost the training convergence by applying a guided attention loss is proposed.

翻译：语音合成在各种各样的行业中使用。但是,它总是听起来平坦或机器人。允许假肢控制的最新方法非常繁琐,使用起来不易调整。要解决其中的一些缺陷,我们在这项工作中的目标是执行文本到语音模型,其中推断的语音可以与期望的情绪调和。为了做到这一点,我们利用关注机制, 使用基因辅助网络(GANs)和序列到序列模型。我们评估四种不同的配置, 考虑不同的投入和培训策略, 研究它们, 证明我们的最佳模式如何产生与初始培训数据集相同的语言文件。此外, 提出了一个新的战略, 通过应用引导关注损失来促进培训的趋同。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

生成对抗网络GAN在各领域应用研究进展(中文版)，37页pdf

专知会员服务

151+阅读 · 2020年12月30日