Speech representation learning has improved both speech understanding and speech synthesis tasks for single language. However, its ability in cross-lingual scenarios has not been explored. In this paper, we extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks, including cross-lingual multi-speaker voice cloning and cross-lingual multi-speaker speech editing. We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes given a speech example and its transcription. By learning to reconstruct the masked parts of the input in different languages, our model shows great improvements over speaker-embedding-based multi-speaker TTS methods. Moreover, our framework is end-to-end for both the training and the inference without any finetuning effort. In cross-lingual multi-speaker voice cloning and cross-lingual multi-speaker speech editing tasks, our experiments show that our model outperforms speaker-embedding-based multi-speaker TTS methods. The code and model are publicly available at PaddleSpeech.
翻译:在本文中,我们扩展了跨语言多语种语音合成任务的培训前方法,包括跨语言多语种语音克隆和跨语言多语种语音编辑。我们提议了一个语言-语言联合培训框架,其中我们随机遮盖了光谱和配有语音示例的电话及其抄录。通过学习重建以不同语言输入的隐蔽部分,我们的模型展示了与以语音组合为基础的多语种语音TTTS方法相比的巨大改进。此外,我们的框架是培训和推断的端对端,不作任何微调。在跨语言多语种语音克隆和跨语言多语种语音编辑任务中,我们的实验显示我们的模型超越了以语音组合为基础的多语种TTTS方法。代码和模型在PaddleSpeech公开使用。