Recent advancements in end-to-end speech synthesis have made it possible to generate highly natural speech. However, training these models typically requires a large amount of high-fidelity speech data, and for unseen texts, the prosody of synthesized speech is relatively unnatural. To address these issues, we propose to combine a fine-tuned BERT-based front-end with a pre-trained FastSpeech2-based acoustic model to improve prosody modeling. The pre-trained BERT is fine-tuned on the polyphone disambiguation task, the joint Chinese word segmentation (CWS) and part-of-speech (POS) tagging task, and the prosody structure prediction (PSP) task in a multi-task learning framework. FastSpeech 2 is pre-trained on large-scale external data that are noisy but easier to obtain. Experimental results show that both the fine-tuned BERT model and the pre-trained FastSpeech 2 can improve prosody, especially for those structurally complex sentences.
翻译:最近,端到端语音合成的进展使得能够产生高度自然的语音。然而,培训这些模型通常需要大量的高不洁语言数据,而对于看不见的文本来说,合成语言的手势相对不自然。为了解决这些问题,我们提议把一个经过精细调整的BERT前端与一个经过预先训练的快速语音2-基于快速语音2的声学模型结合起来,以改进手动模拟。经过预先训练的BERT是针对多语辨别任务、中国联合单词分割和部分语音标记任务,以及多任务学习框架中的假体结构预测任务。FastSpeech 2 预先训练了大规模、吵闹但容易获得的外部数据。实验结果表明,经过精细调整的BERT模型和经过预先训练的快速语音2 都能改进工作,特别是这些结构复杂的句子。