Automatic speaker verification (ASV) has been widely used in the real life for identity authentication. However, with the rapid development of speech conversion and speech synthesis algorithms, ASV systems are vulnerable to spoofing attacks. In recent years, there have many works about synthetic speech detection, researchers had proposed a number of anti-spoofing methods based on hand-crafted features to improve the detection accuracy and robustness of ASV systems. However, using hand-crafted features rather than raw waveform would lose certain information for anti-spoofing, which will reduce the detection performance of the system. Inspired by the promising performance of ConvNeXt in image classification tasks, we revise the ConvNeXt network architecture accordingly for spoof attacks detection task and propose a lightweight end-to-end anti-spoofing model. By integrating the revised architecture with the channel attention block and using the focal loss function, the proposed model can focus on the most informative sub-bands of speech representations and the difficult samples that are hard for models to classify. Experiments show that our proposed best single system could achieve an equal error rate of 0.75% and min-tDCF of 0.0212 for the ASVSpoof 2019 LA evaluation dataset, which outperforms the state-of-the-art systems.
翻译:由于语音转换和语音合成算法的迅速发展,ASV系统很容易受到恐吓攻击。近年来,在合成语音检测方面,研究人员提出了许多基于手工制作特征的防排方法,以提高ASV系统的检测准确性和稳健性。然而,使用手工制作的功能而不是原始波形,将失去某些用于防伪的信息,这将降低系统的检测性能。在ConvNeXt在图像分类任务中前景良好的表现的启发下,我们相应修改ConvXt网络结构,以完成对攻击的检测任务,并提出一个轻量级端到端的反波模型。通过将修订后的结构与频道关注区整合,并利用焦点损失功能,拟议的模型可以侧重于最有信息的小语音表达带和难以分类的样本。实验表明,我们拟议的最佳单一系统可以达到0.75 % 和 MS-xxxxxxxxx的相同误差率率,而AS-S-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx