We propose a novel high-fidelity expressive speech synthesis model, UniTTS, that learns and controls overlapping style attributes avoiding interference. UniTTS represents multiple style attributes in a single unified embedding space by the residuals between the phoneme embeddings before and after applying the attributes. The proposed method is especially effective in controlling multiple attributes that are difficult to separate cleanly, such as speaker ID and emotion, because it minimizes redundancy when adding variance in speaker ID and emotion, and additionally, predicts duration, pitch, and energy based on the speaker ID and emotion. In experiments, the visualization results exhibit that the proposed methods learned multiple attributes harmoniously in a manner that can be easily separated again. As well, UniTTS synthesized high-fidelity speech signals controlling multiple style attributes. The synthesized speech samples are presented at https://jackson-kang.github.io/paper_works/UniTTS/demos.
翻译:我们建议了一个新的高友谊表达式语音合成模型UNITTS,该模型学习并控制了避免干扰的重叠风格属性。 UNITTS代表了多个样式属性,在应用属性之前和之后的语音嵌入器之间,用一个单一的统一嵌入空间,由语音嵌入器的剩余部分组成。拟议方法在控制难以清洁分离的多个属性方面特别有效,如语音识别和情感,因为它在增加发言者身份和情感的差异时最大限度地减少了冗余,另外,还根据发言者身份和情感预测了时间、音道和能量。在实验中,可视化结果显示,拟议方法以易于再分开的方式以和谐的方式学习了多个属性。此外,UITTS还合成了控制多种样式属性的高纤维语音信号。综合语音样本见https://jackson-kang.github.io/paper_works/UniTTS/demos。