Some recent studies have demonstrated the feasibility of single-stage neural text-to-speech, which does not need to generate mel-spectrograms but generates the raw waveforms directly from the text. Single-stage text-to-speech often faces two problems: a) the one-to-many mapping problem due to multiple speech variations and b) insufficiency of high frequency reconstruction due to the lack of supervision of ground-truth acoustic features during training. To solve the a) problem and generate more expressive speech, we propose a novel phoneme-level prosody modeling method based on a variational autoencoder with normalizing flows to model underlying prosodic information in speech. We also use the prosody predictor to support end-to-end expressive speech synthesis. Furthermore, we propose the dual parallel autoencoder to introduce supervision of the ground-truth acoustic features during training to solve the b) problem enabling our model to generate high-quality speech. We compare the synthesis quality with state-of-the-art text-to-speech systems on an internal expressive English dataset. Both qualitative and quantitative evaluations demonstrate the superiority and robustness of our method for lossless speech generation while also showing a strong capability in prosody modeling.
翻译:最近的一些研究显示,单阶段神经文本到声音的可行性并不需要生成中分光谱,而是直接产生原始波形。单阶段文本到语音常常面临两个问题:(a) 由于多种语音变异造成的一到多个绘图问题;(b) 由于在培训期间缺乏对地面真实声学特征的监督,高频重建不足。为了解决问题并产生更清晰的演讲,我们提议采用新型的电话级模拟模型,该模型以变异自动编码为基础,正常地流到语音中的基本信息模型。我们还使用预演预测器支持终端到终端的语音合成。此外,我们提议在培训期间采用双平行自动编码,对地面真实声学特征进行监督,以解决(b)问题,使我们的模型能够生成高质量的语音。我们比较了合成质量和最先进的语音到语音模型系统,以显示我们内部直观语言的优越性,同时展示了我们新一代的定性和定量的语音损失。