Non-parallel voice conversion aims to convert voice from a source domain to a target domain without paired training data. Cycle-Consistent Generative Adversarial Networks (CycleGAN) and Variational Autoencoders (VAE) have been used for this task, but these models suffer from difficult training and unsatisfactory results. Later, Contrastive Voice Conversion (CVC) was introduced, utilizing a contrastive learning-based approach to address these issues. However, these methods use CNN-based generators, which can capture local semantics but lacks the ability to capture long-range dependencies necessary for global semantics. In this paper, we propose VCTR, an efficient method for non-parallel voice conversion that leverages the Hybrid Perception Block (HPB) and Dual Pruned Self-Attention (DPSA) along with a contrastive learning-based adversarial approach. The code can be found in https://github.com/Maharnab-Saikia/VCTR.
翻译:非平行语音转换旨在无需配对训练数据的情况下将语音从源域转换到目标域。循环一致性生成对抗网络(CycleGAN)和变分自编码器(VAE)已被用于此任务,但这些模型存在训练困难且效果不佳的问题。随后,对比语音转换(CVC)被提出,利用基于对比学习的方法来解决这些问题。然而,这些方法使用基于CNN的生成器,虽能捕捉局部语义,但缺乏捕捉全局语义所需的长距离依赖关系的能力。本文提出VCTR,一种高效的非平行语音转换方法,它结合了混合感知模块(HPB)和双重剪枝自注意力机制(DPSA),并采用基于对比学习的对抗性训练策略。代码可在 https://github.com/Maharnab-Saikia/VCTR 获取。