Recent developments in neural speech synthesis and vocoding have sparked a renewed interest in voice conversion (VC). Beyond timbre transfer, achieving controllability on para-linguistic parameters such as pitch and Speed is critical in deploying VC systems in many application scenarios. Existing studies, however, either only provide utterance-level global control or lack interpretability on the controls. In this paper, we propose ControlVC, the first neural voice conversion system that achieves time-varying controls on pitch and speed. ControlVC uses pre-trained encoders to compute pitch and linguistic embeddings from the source utterance and speaker embeddings from the target utterance. These embeddings are then concatenated and converted to speech using a vocoder. It achieves speed control through TD-PSOLA pre-processing on the source utterance, and achieves pitch control by manipulating the pitch contour before feeding it to the pitch encoder. Systematic subjective and objective evaluations are conducted to assess the speech quality and controllability. Results show that, on non-parallel and zero-shot conversion tasks, ControlVC significantly outperforms two other self-constructed baselines on speech quality, and it can successfully achieve time-varying pitch and speed control.
翻译:神经语音合成和语音转换方面的近期发展引起了对语音转换(VC)的新兴趣。除了音调转换(VC)之外,在音速等语言参数上实现控制对于在许多应用情景中部署VC系统至关重要。但是,现有的研究要么只提供发音级全球控制,要么对控制器缺乏解释性。在本文中,我们提议控制语音转换系统,这是第一个在音调和速度上实现时间可变控制的神经声音转换系统。控制VC使用预先训练的编码器来计算源语句和语音嵌入器的音频和语言嵌入。这些嵌入器随后用电动码转换成语音系统。它通过对源语句的预处理实现速度控制,并在将音调调输入到磁调编码器之前对音调控制。进行了系统化的主观和客观评价,以评估语音质量和控制能力。结果显示,在非发音和零发音语音转换任务上,可成功完成其他自动配置速度控制基线。