Beyond the conventional voice conversion (VC) where the speaker information is converted without altering the linguistic content, the background sounds are informative and need to be retained in some real-world scenarios, such as VC in movie/video and VC in music where the voice is entangled with background sounds. As a new VC framework, we have developed a noisy-to-noisy (N2N) VC framework to convert the speaker's identity while preserving the background sounds. Although our framework consisting of a denoising module and a VC module well handles the background sounds, the VC module is sensitive to the distortion caused by the denoising module. To address this distortion issue, in this paper we propose the improved VC module to directly model the noisy speech waveform while controlling the background sounds. The experimental results have demonstrated that our improved framework significantly outperforms the previous one and achieves an acceptable score in terms of naturalness, while reaching comparable similarity performance to the upper bound of our framework.
翻译:在传统声音转换框架之外,发言人信息在不改变语言内容的情况下转换,背景声音是信息性的,需要保留在现实世界的一些情景中,如电影/视频中的VC和音乐中的VC,其中声音与背景声音交织在一起。作为一个新的VC框架,我们开发了一个噪音到噪音(N2N) VC框架,以在保存背景声音的同时转换发言者身份。虽然我们的框架包括一个拆音模块和一个VC模块,它很好地处理背景声音,但VC模块对拆音模块造成的扭曲十分敏感。为了解决这一扭曲问题,我们在本文件中建议改进VC模块,在控制背景声音的同时直接模拟噪音语音波形。实验结果表明,我们改进后的框架大大超越了先前的框架,在自然性方面达到了可接受的分数,同时达到了与我们框架上层相似的类似性表现。