In recent years, synthetic speech generated by advanced text-to-speech (TTS) and voice conversion (VC) systems has caused great harms to automatic speaker verification (ASV) systems, urging us to design a synthetic speech detection system to protect ASV systems. In this paper, we propose a new speech anti-spoofing model named ResWavegram-Resnet (RW-Resnet). The model contains two parts, Conv1D Resblocks and backbone Resnet34. The Conv1D Resblock is based on the Conv1D block with a residual connection. For the first part, we use the raw waveform as input and feed it to the stacked Conv1D Resblocks to get the ResWavegram. Compared with traditional methods, ResWavegram keeps all the information from the audio signal and has a stronger ability in extracting features. For the second part, the extracted features are fed to the backbone Resnet34 for the spoofed or bonafide decision. The ASVspoof2019 logical access (LA) corpus is used to evaluate our proposed RW-Resnet. Experimental results show that the RW-Resnet achieves better performance than other state-of-the-art anti-spoofing models, which illustrates its effectiveness in detecting synthetic speech attacks.
翻译:近年来,由先进的文本到语音系统(TTS)和语音转换系统(VC)产生的合成语音对自动扬声器核查(ASV)系统造成重大伤害,敦促我们设计一个合成语音探测系统来保护ASV系统。在本文中,我们提议了名为ResWavegram-Resnet(RW-Resnet)的新的反播音模型。该模型包含两个部分,即Conv1D 阻隔和主干Resnet34。Conv1D Resstlock以Conv1D块为基础,并有一个剩余连接。首先,我们使用原始波形作为输入,并将其输入堆叠的Conv1D Resblocks。与传统方法相比,Reswavegram将所有信息从音频信号中保存,并具有更强的提取功能。在第二部分,提取的特征被输入到主干线Resnet34,用于作出有剩余连接的Conv1D 2019 逻辑访问(LA) 系统,用于评估我们所拟议的语音-Resnet系统测试的其他结果,从而更好地显示其探测结果。