This paper introduces PnG BERT, a new encoder model for neural TTS. This model is augmented from the original BERT model, by taking both phoneme and grapheme representations of text as input, as well as the word-level alignment between them. It can be pre-trained on a large text corpus in a self-supervised manner, and fine-tuned in a TTS task. Experimental results show that a neural TTS model using a pre-trained PnG BERT as its encoder yields more natural prosody and more accurate pronunciation than a baseline model using only phoneme input with no pre-training. Subjective side-by-side preference evaluations show that raters have no statistically significant preference between the speech synthesized using a PnG BERT and ground truth recordings from professional speakers.
翻译:本文介绍了神经 TTS 的新编码模型PnG BERT。 这个模型从原始的 BERT 模型得到扩展, 其方法是将文字的电话和图形化表述作为输入, 以及它们之间的字级对齐。 它可以自我监督的方式在大量文本材料上预先培训, 并在 TTS 任务中进行微调 。 实验结果表明, 使用预先训练过的 PnG BERT 模型作为编码器的神经 TTS 模型, 比基线模型更自然, 更准确的发音, 仅使用没有训练前的电话机输入。 主观的单方优惠评价显示, 使用 PnG BERT 合成的演讲与专业演讲人现场的真相记录相比, 在统计上没有显著的偏好。