Voice Conversion (VC) for unseen speakers, also known as zero-shot VC, is an attractive research topic as it enables a range of applications like voice customizing, animation production, and others. Recent work in this area made progress with disentanglement methods that separate utterance content and speaker characteristics from speech audio recordings. However, many of these methods are subject to the leakage of prosody (e.g., pitch, volume), causing the speaker voice in the synthesized speech to be different from the desired target speakers. To prevent this issue, we propose a novel self-supervised approach that effectively learns disentangled pitch and volume representations that can represent the prosody styles of different speakers. We then use the learned prosodic representations as conditional information to train and enhance our VC model for zero-shot conversion. In our experiments, we show that our prosody representations are disentangled and rich in prosody information. Moreover, we demonstrate that the addition of our prosody representations improves our VC performance and surpasses state-of-the-art zero-shot VC performances.
翻译:对看不见的发言者,也称为零弹VC, 声音转换(VC)是一个有吸引力的研究课题,因为它能够使声音定制、动画制作等一系列应用得到应用。该领域最近的工作以分解方法取得进展,这些方法将发言内容和发言者特点与语音录音分开。然而,许多这些方法都受到手动外泄(如音道、音量)的影响,使合成演讲中的发言者声音与预期的目标发言者不同。为了防止这一问题,我们提议一种自监督的新办法,有效地学习能够代表不同发言者的亲善风格的分解音和体积表现。我们随后利用所学的主动表述作为有条件的信息来培训和增强我们的VC零弹转换模式。我们在实验中显示,我们的亲善表解和丰富了Prosody信息。此外,我们证明,我们增加的自我监督表能提高我们的VC性能,并超越VC状态的零弹道表现。