Numerous voice conversion (VC) techniques have been proposed for the conversion of voices among different speakers. Although the decent quality of converted speech can be observed when VC is applied in a clean environment, the quality will drop sharply when the system is running under noisy conditions. In order to address this issue, we propose a novel enhancement-based StarGAN (E-StarGAN) VC system, which leverages a speech enhancement (SE) technique for signal pre-processing. SE systems are generally used to reduce noise components in noisy speech and to generate enhanced speech for downstream application tasks. Therefore, we investigated the effectiveness of E-StarGAN, which combines VC and SE, and demonstrated the robustness of the proposed approach in various noisy environments. The results of VC experiments conducted on a Mandarin dataset show that when combined with SE, the proposed E-StarGAN VC model is robust to unseen noises. In addition, the subjective listening test results show that the proposed E-StarGAN model can improve the sound quality of speech signals converted from noise-corrupted source utterances.
翻译:为了解决这一问题,我们提出了一个新的基于StarGAN(E-StarGAN)VC系统,该系统利用语音增强技术进行信号预处理。SE系统通常用于减少噪音部分,并为下游应用任务生成强化的语音。因此,我们调查了将VC和SE相结合的E-StarGAN系统的有效性,并显示了各种噪音环境中拟议方法的稳健性。在曼达林数据集上进行的VC实验结果表明,如果与SE相结合,拟议的E-StarGAN VC模型对隐蔽噪音具有很强的作用。此外,主观听觉测试结果显示,拟议的E-StarGAN模型可以提高由噪音源干扰的语调转换的语音信号的音质。