Background sound is an informative form of art that is helpful in providing a more immersive experience in real-application voice conversion (VC) scenarios. However, prior research about VC, mainly focusing on clean voices, pay rare attention to VC with background sound. The critical problem for preserving background sound in VC is inevitable speech distortion by the neural separation model and the cascade mismatch between the source separation model and the VC model. In this paper, we propose an end-to-end framework via multi-task learning which sequentially cascades a source separation (SS) module, a bottleneck feature extraction module and a VC module. Specifically, the source separation task explicitly considers critical phase information and confines the distortion caused by the imperfect separation process. The source separation task, the typical VC task and the unified task shares a uniform reconstruction loss constrained by joint training to reduce the mismatch between the SS and VC modules. Experimental results demonstrate that our proposed framework significantly outperforms the baseline systems while achieving comparable quality and speaker similarity to the VC models trained with clean data.
翻译:背景声音是一种信息丰富的艺术形式,有助于在实际应用语音转换(VC)情景中提供更深入的经验。然而,以往关于VC的研究主要侧重于清洁声音,很少关注背景声音的VC。在VC中保存背景声音的关键问题是神经分离模式和源分离模式与VC模式之间级联不匹配的必然的语音扭曲。在本文件中,我们建议通过多任务学习建立端对端框架,按顺序将源分离模块、瓶颈特征提取模块和VC模块连成一体。具体地说,源分离任务明确考虑关键阶段信息,并限制不完善分离过程造成的扭曲。源分离任务、典型VC任务和统一任务分担了因联合培训以减少SS模块和VC模块之间的不匹配而受制约的统一重建损失。实验结果表明,我们提议的框架大大超越了基线系统,同时实现了与经过清洁数据培训的VC模型的类似质量和发言者相似性。