Neural networks have been able to generate high-quality single-sentence speech with high expressiveness. However, it remains a challenge concerning paragraph-level speech synthesis due to the need for coherent acoustic features while delivering sentence styles. Meanwhile, training those models directly on over-length speech suffers from degrading synthesizing quality. This paper proposes a high-quality and expressive paragraph speech synthesis system with a multi-step variational autoencoder. Our approach employs multi-step latent variables to capture speech information and predicts them with text information separately at different grammatical levels. We also propose a three-step training method to promote the performance of the decoupling process. The proposed TTS model was trained on a single-speaker French audiobook corpus released at Blizzard Challenge 2023. Experimental results underscore the significant superiority of our system over baseline models.
翻译:暂无翻译