Zero-shot voice conversion is becoming an increasingly popular research direction, as it promises the ability to transform speech to match the vocal identity of any speaker. However, relatively little work has been done on end-to-end methods for this task, which are appealing because they remove the need for a separate vocoder to generate audio from intermediate features. In this work, we propose LVC-VC, an end-to-end zero-shot voice conversion model that uses location-variable convolutions (LVCs) to jointly model the conversion and speech synthesis processes with a small number of parameters. LVC-VC utilizes carefully designed input features that have disentangled content and speaker style information, and the neural vocoder-like architecture learns to combine them to perform voice conversion while simultaneously synthesizing audio. Experiments show that our model achieves competitive or better voice conversion performance compared to several baselines while maintaining intelligibility particularly well.
翻译:零光语音转换正在成为一个越来越受欢迎的研究方向,因为它承诺能够转换语音,使其与任何发言者的声声特征相匹配。 但是,在这项工作的端对端方法方面,相对而言,没有做多少工作,而对于这项工作来说,这些方法是吸引人的,因为它们消除了使用单独的电动编码器从中间特性生成音频的需要。在这项工作中,我们建议使用一个端对端零光语音转换模型LVC-VC,一个端对端零光语音转换模型,使用位置可变聚合(LVCs),用少量参数来联合模拟转换和语音合成过程。 LVC-VC使用精心设计的输入功能,这些功能具有分解的内容和语音风格信息,而类似神经电动电解码的架构则学会在同步音频的同时进行声音转换。实验显示,我们的模型与几个基线相比,具有竞争力或更好的语音转换性能,同时特别保持智能性。