FastSpeech 2: 快速和高质端对端文字到语音 (FastSpeech 2: Fast and High-Quality End-to-End Text to Speech)

Advanced text to speech (TTS) models such as FastSpeech can synthesize speech significantly faster than previous autoregressive models with comparable quality. The training of FastSpeech model relies on an autoregressive teacher model for duration prediction (to provide more information as input) and knowledge distillation (to simplify the data distribution in output), which can ease the one-to-many mapping problem (i.e., multiple speech variations correspond to the same text) in TTS. However, FastSpeech has several disadvantages: 1) the teacher-student distillation pipeline is complicated, 2) the duration extracted from the teacher model is not accurate enough, and the target mel-spectrograms distilled from teacher model suffer from information loss due to data simplification, both of which limit the voice quality. In this paper, we propose FastSpeech 2, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e.g., pitch, energy and more accurate duration) as conditional inputs. Specifically, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs during training and use predicted values during inference. We further design FastSpeech 2s, which is the first attempt to directly generate speech waveform from text in parallel, enjoying the benefit of full end-to-end training and even faster inference than FastSpeech. Experimental results show that 1) FastSpeech 2 and 2s outperform FastSpeech in voice quality with much simplified training pipeline and reduced training time; 2) FastSpeech 2 and 2s can match the voice quality of autoregressive models while enjoying much faster inference speed.

翻译：快速语音(TTS) 等高级语音(TTS) 模型,如快速语音(FastSpeech) 能够比以往的自动递减模型更快地合成语音。快速语音(TTS) 模型的训练取决于一个自动递减教师模型,用于持续预测(提供更多信息作为输入)和知识蒸馏(简化输出数据分布),这可以缓解TS中一对多个绘图问题(例如,多个语音变异与同一文本相对应 ) 。然而,快速语音(TTS) 有几个不利之处:1) 教师-学生蒸馏管道复杂,2 教师模型所提取的时间长度不够准确,2 教师模型所提炼的目标Melprogragrams由于数据简化(提供更多信息作为投入)和知识蒸馏(简化后2号), 快速语音(Streal-production) 将更快速、快速和最精确的语音培训结果(从快速和最精确的状态(我们) 和最精确的状态中,在最精确的状态期间,直接地、最精确的语音和最精确的状态中,可以展示、最精确的、最精确的语音和最精确的状态中进行。