Voice conversion models modify timbre while preserving paralinguistic features, enabling applications like dubbing and identity protection. However, most VC systems require access to target utterances, limiting their use when target data is unavailable or when users desire conversion to entirely novel, unseen voices. To address this, we introduce a lightweight method SpeakerVAE to generate novel speakers for VC. Our approach uses a deep hierarchical variational autoencoder to model the speaker timbre space. By sampling from the trained model, we generate novel speaker representations for voice synthesis in a VC pipeline. The proposed method is a flexible plug-in module compatible with various VC models, without co-training or fine-tuning of the base VC system. We evaluated our approach with state-of-the-art VC models: FACodec and CosyVoice2. The results demonstrate that our method successfully generates novel, unseen speakers with quality comparable to that of the training speakers.
翻译:语音转换模型在保持副语言特征的同时修改音色,实现了配音和身份保护等应用。然而,大多数语音转换系统需要访问目标语音,这在目标数据不可用或用户希望转换为完全新颖、未见过的声音时限制了其使用。为此,我们提出了一种轻量级方法SpeakerVAE,用于为语音转换生成新颖的说话人。我们的方法使用深度分层变分自编码器对说话人音色空间进行建模。通过从训练模型中采样,我们在语音转换流程中生成新颖的说话人表征用于语音合成。所提出的方法是一个灵活的插件模块,兼容多种语音转换模型,无需对基础语音转换系统进行联合训练或微调。我们使用最先进的语音转换模型FACodec和CosyVoice2评估了我们的方法。结果表明,我们的方法成功生成了新颖、未见过的说话人,其质量与训练说话人相当。