One-shot voice conversion (VC) aims to convert speech from any source speaker to an arbitrary target speaker with only a few seconds of reference speech from the target speaker. This relies heavily on disentangling the speaker's identity and speech content, a task that still remains challenging. Here, we propose a novel approach to learning disentangled speech representation by transfer learning from style-based text-to-speech (TTS) models. With cycle consistent and adversarial training, the style-based TTS models can perform transcription-guided one-shot VC with high fidelity and similarity. By learning an additional mel-spectrogram encoder through a teacher-student knowledge transfer and novel data augmentation scheme, our approach results in disentangled speech representation without needing the input text. The subjective evaluation shows that our approach can significantly outperform the previous state-of-the-art one-shot voice conversion models in both naturalness and similarity.
翻译:单发语音转换( VC) 旨在将语言从任何发源方的发言者转换为任意的目标发言者,而目标发言者仅用几秒钟的参考演讲。 这在很大程度上依赖于将演讲人的身份和语言内容脱钩,这仍然是一项挑战性的任务。 我们在这里提出了一个新颖的方法,通过从基于风格的文本到语音模式的学习转移来学习分解的语音表达方式。 通过循环一致和对抗性的培训,基于风格的 TTS 模型可以高忠诚和相似性地执行转录制制的单发VC 。 通过通过教师-学生知识转移和新的数据增强计划学习另外的Mel-spectrographram 编码,我们的方法在不需要输入文本的情况下,导致语言表达脱节化。 主观评价表明,我们的方法在自然性和相似性两方面都大大超过先前的一发音转换模式。