FastSpeech: 语音快速、强力和可控文本 (FastSpeech: Fast, Robust and Controllable Text to Speech)

Neural network based end-to-end text to speech (TTS) has significantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron 2) usually first generate mel-spectrogram from text, and then synthesize speech from mel-spectrogram using vocoder such as WaveNet. Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i.e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control). In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS. Specifically, we extract attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length regulator to expand the source phoneme sequence to match the length of target mel-sprectrogram sequence for parallel mel-sprectrogram generation. Experiments on the LJSpeech dataset show that our parallel model matches autoregressive models in terms of speech quality, nearly eliminates the skipped words and repeated words, and can adjust voice speed smoothly. Most importantly, compared with autoregressive models, our model speeds up the mel-sprectrogram generation by 270x. Therefore, we call our model FastSpeech. We will release the code on Github.

翻译：基于语调端到端文本的神经网络( TTS) 大大改善了合成语调的质量。突出的方法( 例如, Tacotron 2) 通常首先从文本中生成Mel- proctragram, 然后使用 VaveNet 等vocoder 来合成Mel- proctragram的mel- procrogram 。与传统的 concate- decoder 和统计参数参数参数模型相比, 基于端到端模型的神经网络的语调速度缓慢, 而合成语调通常不强( 即, 有些单词被跳过或重复过) 和缺乏可控性( 语音速度或Prosocial) 。在这项工作中,我们提议以变异器为基础, 生成Melforforth- procrocrocrographram 来生成Ml- devolucreal deal deal deal deal deal deal laves the LJJJJ- developre lax deal deal develments lactions) 数据, 我们的Speal deal deal deal deal deal decreal democreal democremocreal democreal demodestrations missations 。我们的磁制的语音缩缩缩缩缩制制制制的图像, 我们的磁制制制制制的磁制制的磁制的图像, 我们制的磁制的磁制制制制制式的磁制的磁制。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

模型优化基础，Sayak Paul，67页ppt

专知会员服务

76+阅读 · 2020年6月8日

一份循环神经网络RNNs简明教程，37页ppt

专知会员服务

173+阅读 · 2020年5月6日

【ACL2020-亚马逊】Transformers多分辨率和多模态语音识别，Multiresolution and Multimodal Speech Recognition with Transformers

专知会员服务

15+阅读 · 2020年5月5日

因果图，Causal Graphs，52页ppt

专知会员服务

250+阅读 · 2020年4月19日