This paper presents a novel design of neural network system for fine-grained style modeling, transfer and prediction in expressive text-to-speech (TTS) synthesis. Fine-grained modeling is realized by extracting style embeddings from the mel-spectrograms of phone-level speech segments. Collaborative learning and adversarial learning strategies are applied in order to achieve effective disentanglement of content and style factors in speech and alleviate the "content leakage" problem in style modeling. The proposed system can be used for varying-content speech style transfer in the single-speaker scenario. The results of objective and subjective evaluation show that our system performs better than other fine-grained speech style transfer models, especially in the aspect of content preservation. By incorporating a style predictor, the proposed system can also be used for text-to-speech synthesis. Audio samples are provided for system demonstration https://daxintan-cuhk.github.io/pl-csd-speech .
翻译:本文展示了一种新颖的神经网络系统设计,用于在语音文本到语音合成中进行细微的风格建模、传输和预测。通过从电话语音部分的Mel-spectrographes嵌入功能,实现了精美的建模。合作学习和对抗性学习战略的应用是为了在语音中有效地解开内容和风格因素,缓解风格建模中的“内容渗漏”问题。提议的系统可用于单声频情景中不同内容的语音风格转移。客观和主观评估的结果显示,我们的系统比其他精美语音风格转移模式表现得更好,特别是在内容保护方面。通过采用风格预测器,拟议的系统也可以用于文本到语音合成。为系统演示 https://daxintan-cuhk.github.io/pl-cd-speech提供了音样样本。