Any-to-any voice conversion (VC) aims to convert the timbre of utterances from and to any speakers seen or unseen during training. Various any-to-any VC approaches have been proposed like AUTOVC, AdaINVC, and FragmentVC. AUTOVC, and AdaINVC utilize source and target encoders to disentangle the content and speaker information of the features. FragmentVC utilizes two encoders to encode source and target information and adopts cross attention to align the source and target features with similar phonetic content. Moreover, pre-trained features are adopted. AUTOVC used dvector to extract speaker information, and self-supervised learning (SSL) features like wav2vec 2.0 is used in FragmentVC to extract the phonetic content information. Different from previous works, we proposed S2VC that utilizes Self-Supervised features as both source and target features for VC model. Supervised phoneme posteriororgram (PPG), which is believed to be speaker-independent and widely used in VC to extract content information, is chosen as a strong baseline for SSL features. The objective evaluation and subjective evaluation both show models taking SSL feature CPC as both source and target features outperforms that taking PPG as source feature, suggesting that SSL features have great potential in improving VC.
翻译:AdaINVC 和 AdaINVC 使用源代码和标注功能来解析这些功能的内容和语音信息。FragmentVC 使用两个编码器来编码源代码和目标信息,并交叉关注将源和目标特征与类似的语音内容相匹配。此外,还采用了预先培训的功能。AUTOVC 使用 dvec 来提取演讲者信息,自监督学习功能(SSL),如 wav2vec 2. 0, 用于FraigmentVC 提取语音内容信息。不同于以往的作品,我们建议S2VC 使用自操作功能作为源和目标特征来编码源和目标信息,并交叉关注将源和目标特征与类似的语音内容相匹配。此外,还采用了预先培训的功能。AUTPOVC 使用 dvec 来提取演讲者信息,自监督功能(SSL) 使用自监督的特性(SSL) 和广受监督的特性(SSL), 显示SL 的特性(SL ) 的深度模型, 和主观性特征(SL ) 显示SL ) 的精度(SL) 的模型, 的精度(SL) 和 的精度(SL), 显示SLF) 的精度(SL) 的精准的精准性) 。