We present a large-scale comparative study of self-supervised speech representation (S3R)-based voice conversion (VC). In the context of recognition-synthesis VC, S3Rs are attractive owing to their potential to replace expensive supervised representations such as phonetic posteriorgrams (PPGs), which are commonly adopted by state-of-the-art VC systems. Using S3PRL-VC, an open-source VC software we previously developed, we provide a series of in-depth objective and subjective analyses under three VC settings: intra-/cross-lingual any-to-one (A2O) and any-to-any (A2A) VC, using the voice conversion challenge 2020 (VCC2020) dataset. We investigated S3R-based VC in various aspects, including model type, multilinguality, and supervision. We also studied the effect of a post-discretization process with k-means clustering and showed how it improves in the A2A setting. Finally, the comparison with state-of-the-art VC systems demonstrates the competitiveness of S3R-based VC and also sheds light on the possible improving directions.
翻译:我们对基于自我监督的语音演示(S3R)的语音转换(VC)进行了大规模比较研究。 在承认-合成VC的背景下,S3R具有吸引力,因为它们有可能取代昂贵的监督演示,例如由最新的VC系统普遍采用的语音后视镜(PPG),我们利用S3PRL-VC(我们以前开发的开放源VC软件S3PRL-VC),在三种VC设置下提供一系列深入的客观和主观分析:一对一(A2O)和一对一(A2A2A) VC,使用语音转换挑战2020(VCC2020)数据集。我们调查了基于S3R的VC的各个方面,包括模型类型、多语种和监督。我们还研究了采用k手段集成的后分解过程的效果,并展示了它在A2A设置中如何改进。最后,与基于S3R系统的州-艺术系统竞争力的比较也表明S3R系统在改进了基于S3R的轻型系统的竞争力。