Prosody modeling is an essential component in modern text-to-speech (TTS) frameworks. By explicitly providing prosody features to the TTS model, the style of synthesized utterances can thus be controlled. However, predicting natural and reasonable prosody at inference time is challenging. In this work, we analyzed the behavior of non-autoregressive TTS models under different prosody-modeling settings and proposed a hierarchical architecture, in which the prediction of phoneme-level prosody features are conditioned on the word-level prosody features. The proposed method outperforms other competitors in terms of audio quality and prosody naturalness in our objective and subjective evaluation.
翻译:假造模型是现代文本到语音(TTS)框架中一个必不可少的组成部分。通过向TTS模型明确提供假造特征,可以控制合成话词的风格。然而,预测自然和合理的假造时间具有挑战性。在这项工作中,我们分析了不同伪造模型环境中的非外向TTS模型的行为,并提出了一个等级结构,其中对电话到等级的假造特征的预测以字级假说特征为条件。在客观和主观评价中,拟议方法在音质和假造自然性方面优于其他竞争对手。