Recently, leveraging BERT pre-training to improve the phoneme encoder in text to speech (TTS) has drawn increasing attention. However, the works apply pre-training with character-based units to enhance the TTS phoneme encoder, which is inconsistent with the TTS fine-tuning that takes phonemes as input. Pre-training only with phonemes as input can alleviate the input mismatch but lack the ability to model rich representations and semantic information due to limited phoneme vocabulary. In this paper, we propose MixedPhoneme BERT, a novel variant of the BERT model that uses mixed phoneme and sup-phoneme representations to enhance the learning capability. Specifically, we merge the adjacent phonemes into sup-phonemes and combine the phoneme sequence and the merged sup-phoneme sequence as the model input, which can enhance the model capacity to learn rich contextual representations. Experiment results demonstrate that our proposed Mixed-Phoneme BERT significantly improves the TTS performance with 0.30 CMOS gain compared with the FastSpeech 2 baseline. The Mixed-Phoneme BERT achieves 3x inference speedup and similar voice quality to the previous TTS pre-trained model PnG BERT
翻译:最近,为改进语音和语音文字中的语音编码器(TTS)而利用BERT的预培训来改进语音编码器(TTS)引起了越来越多的注意。然而,该工作采用了与字符基单位的预先培训,以加强TTS的电话编码器,这与TTS将手机编码器用作输入器的微调不符。只有将电话作为输入输入器的TTS微调才能进行预培训,只有将电话作为输入器进行微调才能减轻输入不匹配,但由于电话词汇有限,无法模拟丰富的表达和语义信息。在本文中,我们提议混合Phoneme BERT模式的一个新变型模式,即使用混合电话和超声频表示器来增强学习能力。具体地说,我们将相邻的电话编码器合并成Sup-phoneme,并将电话序列和合并的语音序列作为模型投入,这可以提高模型学习丰富的背景表述能力。实验结果表明,我们提议的混合-Phoneme BERTER的性能大大改进了TTS的性,与快速Speech 2基线相比,增加了0.30 CMOS。混合PERME模型达到前的快速质量,与前TERTUTERTUTS-CTUTSerging 3x。