Pause insertion, also known as phrase break prediction and phrasing, is an essential part of TTS systems because proper pauses with natural duration significantly enhance the rhythm and intelligibility of synthetic speech. However, conventional phrasing models ignore various speakers' different styles of inserting silent pauses, which can degrade the performance of the model trained on a multi-speaker speech corpus. To this end, we propose more powerful pause insertion frameworks based on a pre-trained language model. Our approach uses bidirectional encoder representations from transformers (BERT) pre-trained on a large-scale text corpus, injecting speaker embedding to capture various speaker characteristics. We also leverage duration-aware pause insertion for more natural multi-speaker TTS. We develop and evaluate two types of models. The first improves conventional phrasing models on the position prediction of respiratory pauses (RPs), i.e., silent pauses at word transitions without punctuation. It performs speaker-conditioned RP prediction considering contextual information and is used to demonstrate the effect of speaker information on the prediction. The second model is further designed for phoneme-based TTS models and performs duration-aware pause insertion, predicting both RPs and punctuation-indicated pauses (PIPs) that are categorized by duration. The evaluation results show that our models improve the precision and recall of pause insertion and the rhythm of synthetic speech.
翻译:插入“暂停”是TTS系统的一个重要部分,因为自然时间的适当暂停会大大增强合成言语的节奏和智能。然而,传统的语法模式忽略了各种发言者插入静默暂停的不同风格,这种模式可以降低在多语种语音保护伞下受过训练的模型的性能。为此,我们提议以预先培训的语言模式为基础,建立更强大的暂停插入框架。我们的方法使用来自合成变压器的双向节奏的合成变压器(BERT)的双向读音代表器,在大规模文本材料库上经过预先培训,注射演讲者嵌入以捕捉各种演讲者的特性。我们还利用长时态暂停插入来吸引更多的自然多语种语音 TTS。我们开发并评价两种模式。首先改进呼吸暂停(RP)定位定位定位的常规暂停模式,即在不标注的情况下静音。根据背景信息对演讲者进行预测,并用来展示演讲者信息对预测的影响。第二个模型是用来为显示手机预测的周期和定型的周期。</s>