Disentanglement of a speaker's timbre and style is very important for style transfer in multi-speaker multi-style text-to-speech (TTS) scenarios. With the disentanglement of timbres and styles, TTS systems could synthesize expressive speech for a given speaker with any style which has been seen in the training corpus. However, there are still some shortcomings with the current research on timbre and style disentanglement. The current method either requires single-speaker multi-style recordings, which are difficult and expensive to collect, or uses a complex network and complicated training method, which is difficult to reproduce and control the style transfer behavior. To improve the disentanglement effectiveness of timbres and styles, and to remove the reliance on single-speaker multi-style corpus, a simple but effective timbre and style disentanglement method is proposed in this paper. The FastSpeech2 network is employed as the backbone network, with explicit duration, pitch, and energy trajectory to represent the style. Each speaker's data is considered as a separate and isolated style, then a speaker embedding and a style embedding are added to the FastSpeech2 network to learn disentangled representations. Utterance level pitch and energy normalization are utilized to improve the decoupling effect. Experimental results demonstrate that the proposed model could synthesize speech with any style seen during training with high style similarity while maintaining very high speaker similarity.
翻译:发音器的音调和风格的分解对于多发式多发式多发式文本到语音(TTS)情景的风格转换非常重要。随着调音和风格的分解,TTS系统可以将特定发言者的表达式与培训文体中看到的任何风格合成。然而,目前对调音和风格分解的研究仍然存在一些缺陷。目前的方法要么需要单发式多发式多发式的多发式录音,而收集这些录音既难又昂贵,或者使用复杂的网络和复杂的培训方法,难以复制和控制风格转换行为。随着调音和风格的分解,TTS系统可以将特定发式的表达式的言调和风格分解。为了提高一个单发式多发式多发式的多发式发言人的调和风格,本文中提出了一种简单但有效的调音和风格的分解方法。快速Speech2网络可以用作主干式语音网络,有清晰的持续时间、倾斜度和能量轨迹以代表风格。每个发言者的数据都被视为一个单独和孤立的样式复制和控制的风格,然后以类似的风格复制和风格的风格复制,同时以演示式高发式格式学习的版本的形式学习和风格进行。随后的演示式嵌入和制式的图像到演示式式式式式的图像到显示。