Zero-shot voice conversion is becoming an increasingly popular research direction, as it promises the ability to transform speech to match the voice style of any speaker. However, little work has been done on end-to-end methods for this task, which are appealing because they remove the need for a separate vocoder to generate audio from intermediate features. In this work, we propose Location-Variable Convolution-based Voice Conversion (LVC-VC), a model for performing end-to-end zero-shot voice conversion that is based on a neural vocoder. LVC-VC utilizes carefully designed input features that have disentangled content and speaker style information, and the vocoder-like architecture learns to combine them to simultaneously perform voice conversion while synthesizing audio. To the best of our knowledge, LVC-VC is one of the first models to be proposed that can perform zero-shot voice conversion in an end-to-end manner, and it is the first to do so using a vocoder-like neural framework. Experiments show that our model achieves competitive or better voice style transfer performance compared to several baselines while maintaining the intelligibility of transformed speech much better.
翻译:零点声音转换正在成为一个越来越受欢迎的研究方向,因为它承诺能够转换语音,使其与任何发言者的声音风格相匹配。 但是,对于这项任务的端对端方法,几乎没有做多少工作,因为这些方法具有吸引力,因为它们消除了使用单独的电动编码器的需要,以便从中间特性生成音频。在这项工作中,我们提出了基于位置-可变组合的语音转换(LVC-VC),这是一个基于神经电动电解器进行端对端零点声音转换的模型。LVC-VC使用精心设计的、内容和发言者风格信息不相干的投入功能,而像vocoder一样的架构则学会将其组合起来,同时进行声音转换,同时合成音频。据我们所知,LVC-VC是第一个可以以端对端方式进行零点声音转换的模型之一,也是第一个使用类似电解码的神经框架这样做的模型。 实验显示,我们的模型在将语音样式转换得更好基线的同时,在几个基线上实现了竞争力或更好的语音风格转换。