In neural text-to-speech (TTS), two-stage system or a cascade of separately learned models have shown synthesis quality close to human speech. For example, FastSpeech2 transforms an input text to a mel-spectrogram and then HiFi-GAN generates a raw waveform from a mel-spectogram where they are called an acoustic feature generator and a neural vocoder respectively. However, their training pipeline is somewhat cumbersome in that it requires a fine-tuning and an accurate speech-text alignment for optimal performance. In this work, we present end-to-end text-to-speech (E2E-TTS) model which has a simplified training pipeline and outperforms a cascade of separately learned models. Specifically, our proposed model is jointly trained FastSpeech2 and HiFi-GAN with an alignment module. Since there is no acoustic feature mismatch between training and inference, it does not requires fine-tuning. Furthermore, we remove dependency on an external speech-text alignment tool by adopting an alignment learning objective in our joint training framework. Experiments on LJSpeech corpus shows that the proposed model outperforms publicly available, state-of-the-art implementations of ESPNet2-TTS on subjective evaluation (MOS) and some objective evaluations.
翻译:在神经文本到声音(TTS)中,两阶段系统或一系列单独学习的模式显示的合成质量接近人类语言。例如,FastSpeech2将输入文本转换成Mel-spectrogrogram,然后HiFi-GAN从一个光谱中产生一个原始波形,分别称为声学特征生成器和神经蒸汽器。然而,他们的培训管道有些繁琐,因为它需要微调和准确的语音文本对最佳性能进行精确的校正。在这项工作中,我们提出了终端到终端文本到语音(E2-E2E-TTS)模型,该模型有一个简化的培训管道,并超越了单独学习模型的系列。具体地说,我们拟议的模型是联合培训快速Speech2 和 HiFi-GAN,使用一个校正模块进行联合培训。由于在培训和推断之间没有声调功能不匹配,因此不需要微调。此外,我们不再依赖外部语音文本调整工具,方法是在联合培训框架中采用一个校正学习目标,从而对公众进行测试。