Any-to-any voice conversion technologies convert the vocal timbre of an utterance to any speaker even unseen during training. Although there have been several state-of-the-art any-to-any voice conversion models, they were all based on clean utterances to convert successfully. However, in real-world scenarios, it is difficult to collect clean utterances of a speaker, and they are usually degraded by noises or reverberations. It thus becomes highly desired to understand how these degradations affect voice conversion and build a degradation-robust model. We report in this paper the first comprehensive study on the degradation robustness of any-to-any voice conversion. We show that the performance of state-of-the-art models nowadays was severely hampered given degraded utterances. To this end, we then propose speech enhancement concatenation and denoising training to improve the robustness. In addition to common degradations, we also consider adversarial noises, which alter the model output significantly yet are human-imperceptible. It was shown that both concatenations with off-the-shelf speech enhancement models and denoising training on voice conversion models could improve the robustness, while each of them had pros and cons.
翻译:任何语音转换技术都可以将任何声音转换为任何声音,即使是在培训期间,也看不到任何声音转换的音质。 虽然有一些最先进的任何语音转换模型, 但这些模型都以清洁的语音转换为成功转换为基础。 但是,在现实世界的情景下,很难收集到一个声音转换技术的清晰表达, 而且它们通常会因噪音或反响而退化。 因此,人们非常希望理解这些退化如何影响声音转换, 并构建一个降解式机器人模型。 我们在本文件中报告了关于任何声音转换为任何声音的退化强度的第一次全面研究。 我们显示,由于声音转换退化, 最先进的模型的性能现在都严重受阻。 为此,我们提出增强声音组合和分解培训以提高声音的稳健性。 除了常见的退化外, 我们还考虑对抗性噪音, 这会大大改变模型的输出, 但却是人类无法理解的。 我们在本文件中报告了关于任何声音转换为任何声音转换为任何声音转换的退化的强度的特性的第一次全面研究。 我们表明,现在最先进的模型的性能被严重地阻碍。 我们表明,因为其表现为退化化的状态模型的功能的功能被改进了每一个声音转换模型,同时对声音转换模型都具有稳健健健健健健健。