Recent expressive text to speech (TTS) models focus on synthesizing emotional speech, but some fine-grained styles such as intonation are neglected. In this paper, we propose QI-TTS which aims to better transfer and control intonation to further deliver the speaker's questioning intention while transferring emotion from reference speech. We propose a multi-style extractor to extract style embedding from two different levels. While the sentence level represents emotion, the final syllable level represents intonation. For fine-grained intonation control, we use relative attributes to represent intonation intensity at the syllable level.Experiments have validated the effectiveness of QI-TTS for improving intonation expressiveness in emotional speech synthesis.
翻译:最近对演讲(TTS)模式的表达式文本侧重于情感语言的合成,但一些细微的风格(如内化)被忽略了。在本文中,我们提议QI-TTS,其目的是更好地转移和控制内向,以进一步传递发言者的提问意图,同时从参考演讲中传递情感。我们提议一个多式提取器,从两个不同层面提取风格嵌入。虽然句子代表情感,但最后的音调水平代表了内向。对于精细的内向控制,我们使用相对属性来代表音级的内向强度。实验证实了QI-TTS在改善情感语言合成的内向性方面的有效性。</s>