In a conventional voice conversion (VC) framework, a VC model is often trained with a clean dataset consisting of speech data carefully recorded and selected by minimizing background interference. However, collecting such a high-quality dataset is expensive and time-consuming. Leveraging crowd-sourced speech data in training is more economical. Moreover, for some real-world VC scenarios such as VC in video and VC-based data augmentation for speech recognition systems, the background sounds themselves are also informative and need to be maintained. In this paper, to explore VC with the flexibility of handling background sounds, we propose a noisy-to-noisy (N2N) VC framework composed of a denoising module and a VC module. With the proposed framework, we can convert the speaker's identity while preserving the background sounds. Both objective and subjective evaluations are conducted, and the results reveal the effectiveness of the proposed framework.
翻译:在常规语音转换(VC)框架内,VC模式往往经过培训,具备由谨慎记录和通过尽量减少背景干扰而选择的语音数据组成的清洁数据集。然而,收集如此高质量的数据集既费钱又费时。在培训中利用众源语音数据更经济。此外,对于一些真实的VC情景,如VC视频和VC语音识别系统的数据增强,背景本身也信息丰富,需要保持。在本文件中,为了以灵活处理背景声音来探索VC,我们提议采用由拆卸模块和VC模块组成的吵吵到吵的VC VC框架。根据拟议框架,我们可以转换演讲者的身份,同时保留背景声音。进行客观和主观的评价,结果显示拟议框架的有效性。