Non-autoregressive text-to-speech (NAR-TTS) models such as FastSpeech 2 and Glow-TTS can synthesize high-quality speech from the given text in parallel. After analyzing two kinds of generative NAR-TTS models (VAE and normalizing flow), we find that: VAE is good at capturing the long-range semantics features (e.g., prosody) even with small model size but suffers from blurry and unnatural results; and normalizing flow is good at reconstructing the frequency bin-wise details but performs poorly when the number of model parameters is limited. Inspired by these observations, to generate diverse speech with natural details and rich prosody using a lightweight architecture, we propose PortaSpeech, a portable and high-quality generative text-to-speech model. Specifically, 1) to model both the prosody and mel-spectrogram details accurately, we adopt a lightweight VAE with an enhanced prior followed by a flow-based post-net with strong conditional inputs as the main architecture. 2) To further compress the model size and memory footprint, we introduce the grouped parameter sharing mechanism to the affine coupling layers in the post-net. 3) To improve the expressiveness of synthesized speech and reduce the dependency on accurate fine-grained alignment between text and speech, we propose a linguistic encoder with mixture alignment combining hard inter-word alignment and soft intra-word alignment, which explicitly extracts word-level semantic information. Experimental results show that PortaSpeech outperforms other TTS models in both voice quality and prosody modeling in terms of subjective and objective evaluation metrics, and shows only a slight performance degradation when reducing the model parameters to 6.7M (about 4x model size and 3x runtime memory compression ratio compared with FastSpeech 2). Our extensive ablation studies demonstrate that each design in PortaSpeech is effective.
翻译:快速语音2 和 Glow-TTS 等非偏向性文本到语音(NAR-TTS) 模型。 快速语音2 和 Glow-TTS 等非偏向性文本比方( NAR- TTS ) 模型可以同时合成来自给定文本的高质量语音。 在分析了两种基因化NAR- TTS 模型( VAE 和 正常流) 之后, 我们发现: VAE 能够捕捉远程语义学特征( 例如 手动), 即使模型大小较小, 但也受到模糊和反常的结果; 正常流流流流流流有助于重建频率双双双双双双双双双双双双双双双双对齐键。 我们采用一个较轻的语音比对比对比方( PortaSpeople-T) 模型, 并在前一个更强的基于动态后对内脏的后对内端输入, 更精确的内脏对内脏的内行进行精确度分析, 进一步将内脏内脏对内化的内脏内质分析, 演示后, 演示后, 演示内装的内装的内装的内装的内装的内装的内装的内装的内装改进后, 向内装的内装的内装的内装改进内装的内装的内装的内装的内装的内装的内装的内装的内装的内装的内装的内装式设计, 向内装式的内装的内装的内装的内装式的内装式变。