We propose a training method for spontaneous speech synthesis models that guarantees the consistency of linguistic parts of synthesized speech. Personalized spontaneous speech synthesis aims to reproduce the individuality of disfluency, such as filled pauses. Our prior model includes a filled-pause prediction model and synthesizes filled-pause-included speech from text without filled pauses. However, inserting the filled pauses degrades the quality of the linguistic parts of the synthesized speech. This might be because filled-pause insertion tendencies differ between training and inference, and the synthesis model cannot represent connections between filled pauses and surrounding phonemes in inference. We, therefore, developed a linguistic-speech consistency training that guarantees the consistency of linguistic parts of synthetic speech with and without filled pauses. The proposed consistency training utilizes not only ground-truth-filled pauses but also pseudo ones. Our experiments demonstrate that this method improves the naturalness of the synthetic linguistic speech and the entire predicted-filled-pause-included synthetic speech.
翻译:我们建议了一种自发语言合成模型的培训方法,保证合成语言语言部分的一致性。个性化自发语言合成旨在复制不流利的个性性,例如填充的暂停。我们先前的模型包括一个填充的pause预测模型和从文本中合成填充的pause 包含的语音,而没有填充的暂停。不过,插入填充的暂停会降低合成语言部分的质量。这可能是因为填充的插入趋势在培训和推断之间有所不同,而合成模型不能代表填充的暂停和周围的语音推断之间的关联。因此,我们开发了一个语言-语音一致性培训,保证合成语言部分语言部分与不填充的暂停保持一致。拟议的一致性培训不仅利用了填充地面的暂停,而且还使用了假体。我们的实验表明,这种方法可以改善合成语言语音和整个预测填充的合成合成合成语言语言语言语言部分的自然性。