One-shot voice conversion (VC), which performs conversion across arbitrary speakers with only a single target-speaker utterance for reference, can be effectively achieved by speech representation disentanglement. Existing work generally ignores the correlation between different speech representations during training, which causes leakage of content information into the speaker representation and thus degrades VC performance. To alleviate this issue, we employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training, to achieve proper disentanglement of content, speaker and pitch representations, by reducing their inter-dependencies in an unsupervised manner. Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations for retaining source linguistic content and intonation variations, while capturing target speaker characteristics. In doing so, the proposed approach achieves higher speech naturalness and speaker similarity than current state-of-the-art one-shot VC systems. Our code, pre-trained models and demo are available at https://github.com/Wendison/VQMIVC.
翻译:单发语音转换(VC)是任意发言者的转换,只有单一目标发言者的发音,仅供参考,这种转换可以有效地通过语言表达方式解脱来实现; 现有工作通常忽视培训期间不同语言表达方式之间的相互关系,这导致内容信息渗漏到发言者的表述方式,从而降低VC的性能; 为缓解这一问题,我们使用矢量量化(VQ)进行内容编码,并在培训期间采用相互信息(MI)作为相关度量度,通过以不受监督的方式减少内容、发言者和声调表达方式的适当分解,减少其相互依存关系; 实验结果反映了拟议方法在学习有效解乱的语音表达方式以保留源语言内容和改变国名方面的优势,同时抓住了目标发言者的特征; 在这样做时,拟议方法实现了比当前状态更自然的语音特征和发言者相似性,我们在https://github.com/Wendrison/VQMIVC上提供了我们的代码、预先培训模式和演示。