Large-scale pre-trained language models have been shown to be helpful in improving the naturalness of text-to-speech (TTS) models by enabling them to produce more naturalistic prosodic patterns. However, these models are usually word-level or sup-phoneme-level and jointly trained with phonemes, making them inefficient for the downstream TTS task where only phonemes are needed. In this work, we propose a phoneme-level BERT (PL-BERT) with a pretext task of predicting the corresponding graphemes along with the regular masked phoneme predictions. Subjective evaluations show that our phoneme-level BERT encoder has significantly improved the mean opinion scores (MOS) of rated naturalness of synthesized speech compared with the state-of-the-art (SOTA) StyleTTS baseline on out-of-distribution (OOD) texts.
翻译:事实证明,大规模的预先培训语言模型有助于改善文本到语音(TTS)模型的自然性,使这些模型能够产生更自然的推进模式,然而,这些模型通常是单词级或超电话级,并用电话联合培训,使得这些模型在下游TTS任务中仅需要电话时效率低下。在这项工作中,我们建议采用电话级BERT(PL-BERT),其借口任务是预测相应的图形,同时定期进行隐蔽的电话网预测。主观评估表明,我们的电话网级BERT编码器大大改进了合成话与最新技术(SOOD)文本相比,在合成话的自然性评级方面的平均意见分数。