Building a voice conversion system for noisy target speakers, such as users providing noisy samples or Internet found data, is a challenging task since the use of contaminated speech in model training will apparently degrade the conversion performance. In this paper, we leverage the advances of our recently proposed Glow-WaveGAN and propose a noise-independent speech representation learning approach for high-quality voice conversion for noisy target speakers. Specifically, we learn a latent feature space where we ensure that the target distribution modeled by the conversion model is exactly from the modeled distribution of the waveform generator. With this premise, we further manage to make the latent feature to be noise-invariant. Specifically, we introduce a noise-controllable WaveGAN, which directly learns the noise-independent acoustic representation from waveform by the encoder and conducts noise control in the hidden space through a FiLM module in the decoder. As for the conversion model, importantly, we use a flow-based model to learn the distribution of noise-independent but speaker-related latent features from phoneme posteriorgrams. Experimental results demonstrate that the proposed model achieves high speech quality and speaker similarity in the voice conversion for noisy target speakers.
翻译:在本文中,我们利用我们最近提议的Glow-WaveGAN的进度,为噪音目标发言者的高质量语音转换提出一个无噪音的语音代表学习方法。具体地说,我们学习了一个潜在特征空间,确保转换模式所模拟的目标分布完全来自波形发电机的模型分布。在此前提下,我们进一步设法使潜在特征成为噪音变异。具体地说,我们引入了一种噪音控制波式WaveGAN,直接从编码器的波形中学习噪音依赖声的声学代表,并通过解码器的FILM模块在隐蔽空间进行噪音控制。关于转换模式,我们使用流基模型来学习语音不依赖声音但与声音相关的潜伏特征的分布。实验结果显示,拟议模型在语音变换中实现了高语音质量和声音变压目标相似性。