Nonparallel multi-domain voice conversion methods such as the StarGAN-VCs have been widely applied in many scenarios. However, the training of these models usually poses a challenge due to their complicated adversarial network architectures. To address this, in this work we leverage the state-of-the-art contrastive learning techniques and incorporate an efficient Siamese network structure into the StarGAN discriminator. Our method is called SimSiam-StarGAN-VC and it boosts the training stability and effectively prevents the discriminator overfitting issue in the training process. We conduct experiments on the Voice Conversion Challenge (VCC 2018) dataset, plus a user study to validate the performance of our framework. Our experimental results show that SimSiam-StarGAN-VC significantly outperforms existing StarGAN-VC methods in terms of both the objective and subjective metrics.
翻译:类似StarGAN-VCs等非平行多域语音转换方法在许多场景中被广泛应用,然而,这些模型的培训通常因其复杂的对抗性网络结构而构成挑战。为了解决这个问题,我们在这项工作中利用最先进的对比学习技术,并将高效的暹罗网络结构纳入StarGAN歧视器中。我们的方法称为SimSiam-StarGAN-VC,它提高了培训稳定性,有效防止了培训过程中的歧视问题。我们在语音转换挑战(VCC 2018)数据集上进行了实验,并进行了用户研究,以验证我们框架的性能。我们的实验结果表明,SimSiam-StarGAN-VC在客观和主观衡量标准上都大大超越了现有的StarGAN-VC方法。