PortaSpeech: 便携和高品质创制文本到语音 (PortaSpeech: Portable and High-Quality Generative Text-to-Speech)

Non-autoregressive text-to-speech (NAR-TTS) models such as FastSpeech 2 and Glow-TTS can synthesize high-quality speech from the given text in parallel. After analyzing two kinds of generative NAR-TTS models (VAE and normalizing flow), we find that: VAE is good at capturing the long-range semantics features (e.g., prosody) even with small model size but suffers from blurry and unnatural results; and normalizing flow is good at reconstructing the frequency bin-wise details but performs poorly when the number of model parameters is limited. Inspired by these observations, to generate diverse speech with natural details and rich prosody using a lightweight architecture, we propose PortaSpeech, a portable and high-quality generative text-to-speech model. Specifically, 1) to model both the prosody and mel-spectrogram details accurately, we adopt a lightweight VAE with an enhanced prior followed by a flow-based post-net with strong conditional inputs as the main architecture. 2) To further compress the model size and memory footprint, we introduce the grouped parameter sharing mechanism to the affine coupling layers in the post-net. 3) To improve the expressiveness of synthesized speech and reduce the dependency on accurate fine-grained alignment between text and speech, we propose a linguistic encoder with mixture alignment combining hard inter-word alignment and soft intra-word alignment, which explicitly extracts word-level semantic information. Experimental results show that PortaSpeech outperforms other TTS models in both voice quality and prosody modeling in terms of subjective and objective evaluation metrics, and shows only a slight performance degradation when reducing the model parameters to 6.7M (about 4x model size and 3x runtime memory compression ratio compared with FastSpeech 2). Our extensive ablation studies demonstrate that each design in PortaSpeech is effective.

翻译：快速语音2 和 Glow-TTS 等非偏向性文本到语音(NAR-TTS) 模型。快速语音2 和 Glow-TTS 等非偏向性文本比方( NAR- TTS ) 模型可以同时合成来自给定文本的高质量语音。在分析了两种基因化NAR- TTS 模型( VAE 和正常流) 之后, 我们发现: VAE 能够捕捉远程语义学特征( 例如手动), 即使模型大小较小, 但也受到模糊和反常的结果; 正常流流流流流流有助于重建频率双双双双双双双双双双双双双双双双对齐键。我们采用一个较轻的语音比对比对比方( PortaSpeople-T) 模型, 并在前一个更强的基于动态后对内脏的后对内端输入, 更精确的内脏对内脏的内行进行精确度分析, 进一步将内脏内脏对内化的内脏内质分析, 演示后, 演示后, 演示内装的内装的内装的内装的内装的内装的内装的内装的内装改进后, 向内装的内装的内装的内装改进内装的内装的内装的内装的内装的内装的内装的内装的内装的内装的内装的内装式设计, 向内装式的内装的内装的内装的内装式的内装式变。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【USC-Aaron Chan博士答辩Slides】可信自然语言处理机器解释的生成与利用, 242页ppt，Generating and Utilizing Machine Explanations for Trustworthy NLP

专知会员服务

16+阅读 · 2022年3月13日

【PAISS 2021 教程】概率散度与生成式模型，92页ppt

专知会员服务

34+阅读 · 2021年11月30日

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

45+阅读 · 2020年10月31日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日