One-shot voice conversion (VC) with only a single target speaker's speech for reference has become a hot research topic. Existing works generally disentangle timbre, while information about pitch, rhythm and content is still mixed together. To perform one-shot VC effectively with further disentangling these speech components, we employ random resampling for pitch and content encoder and use the variational contrastive log-ratio upper bound of mutual information and gradient reversal layer based adversarial mutual information learning to ensure the different parts of the latent space containing only the desired disentangled representation during training. Experiments on the VCTK dataset show the model achieves state-of-the-art performance for one-shot VC in terms of naturalness and intellgibility. In addition, we can transfer characteristics of one-shot VC on timbre, pitch and rhythm separately by speech representation disentanglement. Our code, pre-trained models and demo are available at https://im1eon.github.io/IS2022-SRDVC/.
翻译:仅用一个目标演讲者演讲作为参考的单发语音转换(VC)已成为一个热门的研究课题。现有的工作一般不协调,而关于音调、节奏和内容的信息仍然混杂在一起。要有效地执行一发VC,进一步分解这些语音组成部分,我们就对音调和内容编码器进行随机抽查,并使用基于相互信息和渐变逆转层的反向对比对式对调相互学习,以确保潜在空间的不同部分,其中仅包含培训期间所希望的分解代表。VCTK数据集的实验显示,该模型在自然性和不易感性方面达到了一发VC的最先进的性能。此外,我们还可以通过语音表达解析,将一发VC的特征分别转换成字调、音调和节奏。我们的代码、预先训练的模型和演示可在https://im1eon.github.io/IS2022-SRVC/上查阅。