Style transfer for out-of-domain (OOD) speech synthesis aims to generate speech samples with unseen style (e.g., speaker identity, emotion, and prosody) derived from an acoustic reference, while facing the following challenges: 1) The highly dynamic style features in expressive voice are difficult to model and transfer; and 2) the TTS models should be robust enough to handle diverse OOD conditions that differ from the source data. This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice. GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components: 1) a multi-level style adaptor to efficiently model a large range of style conditions, including global speaker and emotion characteristics, and the local (utterance, phoneme, and word-level) fine-grained prosodic representations; and 2) a generalizable content adaptor with Mix-Style Layer Normalization to eliminate style information in the linguistic content representation and thus improve model generalization. Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity. The extension studies to adaptive style transfer further show that GenerSpeech performs robustly in the few-shot data setting. Audio samples are available at \url{https://GenerSpeech.github.io/}
翻译:(OOD) 语音合成外部的样式传输(OOOD), 目的是生成来自声学参考的隐蔽风格(如语音身份、情感和流体)的语音样本, 并同时面临以下挑战:(1) 表达声音中的高度动态样式特征难以建模和传输;(2) TTS 模型应足够强大,足以处理与源数据不同的多种 OOOD条件。本文提议GenerSpeech, 文本到语音模型, 向高菲度零弹式转移 OOOD 传统声音的文本到语音模式。 GenerSpeech 通过引入两个组成部分,将语音变异转换到样式的和风格特定部分:1) 一个多层次样式调整器,以高效模式构建广泛的样式条件,包括全球语言和情感特征,以及本地(直径、电话和字级)微分解的Prosodication;和2个与Mix- Style平面层平流化的可概括性调整器, 以消除语言内容表达方式信息,从而改进模式的风格和风格特定部分。我们在零位式样样样样样的Speople Styal- saltyle Strealex 的样本中, 演示中, 将数据转换为Speoplemental