Voice conversion has made great progress in the past few years under the studio-quality test scenario in terms of speech quality and speaker similarity. However, in real applications, test speech from source speaker or target speaker can be corrupted by various environment noises, which seriously degrade the speech quality and speaker similarity. In this paper, we propose a novel encoder-decoder based noise-robust voice conversion framework, which consists of a speaker encoder, a content encoder, a decoder, and two domain adversarial neural networks. Specifically, we integrate disentangling speaker and content representation technique with domain adversarial training technique. Domain adversarial training makes speaker representations and content representations extracted by speaker encoder and content encoder from clean speech and noisy speech in the same space, respectively. In this way, the learned speaker and content representations are noise-invariant. Therefore, the two noise-invariant representations can be taken as input by the decoder to predict the clean converted spectrum. The experimental results demonstrate that our proposed method can synthesize clean converted speech under noisy test scenarios, where the source speech and target speech can be corrupted by seen or unseen noise types during the training process. Additionally, both speech quality and speaker similarity are improved.
翻译:在过去几年里,在演播室质量测试情景下,语音转换在语言质量和发言者相似性方面取得了很大进展,但在实际应用中,源演讲人或目标演讲人的测试性演讲可能因各种环境噪音而腐蚀,这些噪音严重地降低发言质量和发言者相似性;在本文中,我们提议建立一个基于噪音-机器人的新建编码器-代代coder声音转换框架,其中包括一个发言者编码器、一个内容编码器、一个解码器和两个域对称神经网络。具体地说,我们将脱钩的演讲人和内容表述技术与域对称培训技术结合起来。多功能性对抗性培训使发言者的演讲人和内容编码器分别来自同一空间的清洁演讲人和内容编码器和吵闹式演讲人,这样,学习的演讲人和内容表达方式是噪声不动的。因此,两种噪音变异性表述可以作为解码器用来预测清洁转换频谱的投入。实验结果表明,我们提出的方法可以将清洁的转换发言与域对称法结合起来,即源演讲和目标演讲的质量可以通过视觉或感应变。