In this paper, we describe our speech generation system for the first Audio Deep Synthesis Detection Challenge (ADD 2022). Firstly, we build an any-to-many voice conversion (VC) system to convert source speech with arbitrary language content into the target speaker%u2019s fake speech. Then the converted speech generated from VC is post-processed in the time domain to improve the deception ability. The experimental results show that our system has adversarial ability against anti-spoofing detectors with a little compromise in audio quality and speaker similarity. This system ranks top in Track 3.1 in the ADD 2022, showing that our method could also gain good generalization ability against different detectors.
翻译:在本文中,我们描述我们为首个音频深合成探测挑战(ADD 2022)的语音生成系统。 首先,我们建立了一个任何到多种语音转换系统,将任意语言内容的源语言转换转换成目标扬声器%u2019s假话。然后,从 VC 生成的转换的语音在时间范围内被后处理,以提高欺骗能力。实验结果显示,我们的系统具有对抗防伪探测器的对抗能力,在音频质量和声频相似性方面稍稍有妥协。这个系统在ADD 2022年第3.1轨中名列前列,表明我们的方法也可以针对不同的探测器获得良好的普及能力。