Speech representation learning has improved both speech understanding and speech synthesis tasks for single language. However, its ability in cross-lingual scenarios has not been explored. In this paper, we extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks, including cross-lingual multi-speaker voice cloning and cross-lingual multi-speaker speech editing. We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes given a speech example and its transcription. By learning to reconstruct the masked parts of the input in different languages, our model shows great improvements over speaker-embedding-based multi-speaker TTS methods. Moreover, our framework is end-to-end for both the training and the inference without any finetuning effort. In cross-lingual multi-speaker voice cloning and cross-lingual multi-speaker speech editing tasks, our experiments show that our model outperforms speaker-embedding-based multi-speaker TTS methods.
翻译:语言语言教学提高了单一语言的语言理解和语言合成任务,然而,没有探讨其跨语言情景的能力。在本文中,我们扩展了跨语言多语种语言语言语言合成任务的培训前方法,包括跨语言多语种语音克隆和跨语言多语种语言语音编辑。我们提议了一个语言文本联合培训框架,其中我们随机遮盖了光谱和配有语音示例的电话及其抄录。通过学习以不同语言重建内容中隐蔽部分的内容,我们的模型展示了与基于语音的多语种语言语音合成技术方法相比的巨大改进。此外,我们的框架是培训和推断的端对端,不作任何微调。在跨语言多语种语言的语音克隆和跨语言多语种语音编辑任务中,我们的实验显示我们的模型超越了基于语言组合的多语种语音技术方法。