Neural text-to-speech (TTS) models can synthesize natural human speech when trained on large amounts of transcribed speech. However, collecting such large-scale transcribed data is expensive. This paper proposes an unsupervised pre-training method for a sequence-to-sequence TTS model by leveraging large untranscribed speech data. With our pre-training, we can remarkably reduce the amount of paired transcribed data required to train the model for the target downstream TTS task. The main idea is to pre-train the model to reconstruct de-warped mel-spectrograms from warped ones, which may allow the model to learn proper temporal assignment relation between input and output sequences. In addition, we propose a data augmentation method that further improves the data efficiency in fine-tuning. We empirically demonstrate the effectiveness of our proposed method in low-resource language scenarios, achieving outstanding performance compared to competing methods. The code and audio samples are available at: https://github.com/cnaigithub/SpeechDewarping
翻译:神经语音合成模型需要使用大量的转录语音数据才能生成自然的人类语音。但是,采集这样大规模的转录数据非常昂贵。本文提出了一种无监督预训练的方法,通过利用大量未转录的语音数据,为序列到序列(sequence-to-sequence)的语音合成模型做准备。通过我们的预训练,可以显著减少用于训练后续目标语音合成任务的成对转录数据的数量。该方法的主要思想是训练模型从扭曲的梅尔频谱图中重建出未扭曲的梅尔频谱图,从而使模型能够学习合适的输入输出序列之间的时间关系。此外,我们还提出了一种数据增强方法,进一步提高了微调模型的数据效率。我们通过实验证明了我们提出的方法在低资源语言场景下的有效性,与其他竞争方法相比表现出色。代码和音频样例可在以下链接中获得:https://github.com/cnaigithub/SpeechDewarping