Voice Conversion (VC) for unseen speakers, also known as zero-shot VC, is an attractive topic due to its usefulness in real use-case scenarios. Recent work in this area made progress with disentanglement methods that separate utterance content and speaker characteristics. Although crucial, extracting disentangled prosody characteristics for unseen speakers remains an open issue. In this paper, we propose a novel self-supervised approach to effectively learn the prosody characteristics. Then, we use the learned prosodic representations to train our VC model for zero-shot conversion. Our evaluation demonstrates that we can efficiently extract disentangled prosody representation. Moreover, we show improved performance compared to the state-of-the-art zero-shot VC models.
翻译:对隐形发言者来说,声音转换(VC),又称零弹VC,是一个有吸引力的专题,因为它在实际使用情况下很有用。该领域最近的工作取得了进步,采用分解方法,将发言内容和发言者特点分开。虽然关键,但为隐形发言者提取不相交织的行曲特征仍然是个未决问题。在本文中,我们提出了一种创新的自我监督方法,以有效学习行曲特征。然后,我们用所学的预言演示方法来训练我们的VC零弹转换模式。我们的评估表明,我们可以有效地提取分解的行曲代表。此外,我们展示了与最先进的零弹VC模型相比的性能改善。