Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus, which is troublesome to collect. In this paper, we propose a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training. By leveraging wav2vec2.0 representation, unlabeled speech can highly improve performance, especially in the lack of labeled speech. We also extend the proposed method to zero-shot multi-speaker TTS (ZS-TTS). The experimental results verify the effectiveness of the proposed method in terms of naturalness, intelligibility, and speaker generalization. We highlight that the single speaker TTS model fine-tuned on the only 10 minutes of labeled dataset outperforms the other baselines, and the ZS-TTS model fine-tuned on the only 30 minutes of single speaker dataset can generate the voice of the arbitrary speaker, by pre-training on unlabeled multi-speaker speech corpus.
翻译:培训文本到语音模式( TTS) 需要大规模文本标记的语音材料, 这很难收集。 在本文中, 我们建议为 TTS 提供一个传输学习框架, 使用大量未贴标签的语音数据集进行预培训。 使用 wav2vec2. 0 表达方式, 无标签的演讲可以大大改善性能, 特别是在没有标签的演讲的情况下。 我们还将建议的方法推广到零弹多发多发语音 TTS( ZS- TTS ) 。 实验结果验证了拟议方法在自然性、 智能和语音概括方面的有效性 。 我们强调, 单发演讲的 TTS 模型只对标签数据集的10分钟进行微调, 超越了其他基线, 而 ZS- TTS 模型则只对单发30分钟的发言者数据集进行微调。 我们通过对未贴标签的多发声材料进行预先培训, 可以产生任意发声者的声音 。