The few-shot multi-speaker multi-style voice cloning task is to synthesize utterances with voice and speaking style similar to a reference speaker given only a few reference samples. In this work, we investigate different speaker representations and proposed to integrate pretrained and learnable speaker representations. Among different types of embeddings, the embedding pretrained by voice conversion achieves the best performance. The FastSpeech 2 model combined with both pretrained and learnable speaker representations shows great generalization ability on few-shot speakers and achieved 2nd place in the one-shot track of the ICASSP 2021 M2VoC challenge.
翻译:少数多声频多式语音克隆的任务是将语音和语音风格综合在一起,这与只给出了几个参考样本的参考发言者相似,我们调查了不同的演讲人陈述,并提议将预先培训和可学习的演讲人陈述结合起来。在不同类型的嵌入装置中,通过语音转换预先培训的嵌入装置取得最佳效果。快速语音2模型加上预先培训和可学习的演讲人陈述表明,对少数演讲人具有很强的概括能力,在ICASSP 2021 M2VoC挑战的一拍轨道中位居第2位。