可控和无损的非自动递减端至端至端文字到语音 (Controllable and Lossless Non-Autoregressive End-to-End Text-to-Speech)

Some recent studies have demonstrated the feasibility of single-stage neural text-to-speech, which does not need to generate mel-spectrograms but generates the raw waveforms directly from the text. Single-stage text-to-speech often faces two problems: a) the one-to-many mapping problem due to multiple speech variations and b) insufficiency of high frequency reconstruction due to the lack of supervision of ground-truth acoustic features during training. To solve the a) problem and generate more expressive speech, we propose a novel phoneme-level prosody modeling method based on a variational autoencoder with normalizing flows to model underlying prosodic information in speech. We also use the prosody predictor to support end-to-end expressive speech synthesis. Furthermore, we propose the dual parallel autoencoder to introduce supervision of the ground-truth acoustic features during training to solve the b) problem enabling our model to generate high-quality speech. We compare the synthesis quality with state-of-the-art text-to-speech systems on an internal expressive English dataset. Both qualitative and quantitative evaluations demonstrate the superiority and robustness of our method for lossless speech generation while also showing a strong capability in prosody modeling.

翻译：最近的一些研究显示,单阶段神经文本到声音的可行性并不需要生成中分光谱,而是直接产生原始波形。单阶段文本到语音常常面临两个问题:(a) 由于多种语音变异造成的一到多个绘图问题;(b) 由于在培训期间缺乏对地面真实声学特征的监督,高频重建不足。为了解决问题并产生更清晰的演讲,我们提议采用新型的电话级模拟模型,该模型以变异自动编码为基础,正常地流到语音中的基本信息模型。我们还使用预演预测器支持终端到终端的语音合成。此外,我们提议在培训期间采用双平行自动编码,对地面真实声学特征进行监督,以解决(b)问题,使我们的模型能够生成高质量的语音。我们比较了合成质量和最先进的语音到语音模型系统,以显示我们内部直观语言的优越性,同时展示了我们新一代的定性和定量的语音损失。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日