Automatic speaker verification is susceptible to various manipulations and spoofing, such as text-to-speech (TTS) synthesis, voice conversion (VC), replay, tampering, and so on. In this paper, we consider a new spoofing scenario called "Partial Spoof" (PS) in which synthesized or transformed audio segments are embedded into a bona fide speech utterance. While existing countermeasures (CMs) can detect fully spoofed utterances, there is a need for their adaptation or extension to the PS scenario to detect utterances in which only a part of the audio signal is generated and hence only a fraction of an utterance is spoofed. For improved explainability, such new CMs should ideally also be able to detect such short spoofed segments. Our previous study introduced the first version of a speech database suitable for training CMs for the PS scenario and showed that, although it is possible to train CMs to execute the two types of detection described above, there is much room for improvement. In this paper we propose various improvements to construct a significantly more accurate CM that can detect short generated spoofed audio segments at finer temporal resolutions. First, we introduce newly proposed self-supervised pre-trained models as enhanced feature extractors. Second, we extend the PartialSpoof database by adding segment labels for various temporal resolutions, ranging from 20 ms to 640 ms. Third, we propose a new CM and training strategies that enable the simultaneous use of the utterance-level and segment-level labels at different temporal resolutions. We also show that the proposed CM is capable of detecting spoofing at the utterance level with low error rates, not only in the PS scenario but also in a related logical access (LA) scenario. The equal error rates of utterance-level detection on the PartialSpoof and the ASVspoof 2019 LA database were 0.47% and 0.59%, respectively.
翻译:自动扬声器校验容易被各种操纵和假言,例如文本到语音合成(TTS)合成、语音转换(VC)、重放、篡改等等。在本文中,我们考虑的是名为“Partial Spoof”(PS)的新的伪言假想,其中合成或转换的音频段嵌入善意的语音表达式。虽然现有的反措施(CM)可以完全检测出虚伪的言词,但需要将其修改或扩展至 PS 假设情景,以探测只产生部分音频信号的语句,因此只有部分音频信号被读取。为了更精确的解释性,这种新的CMM(PPS)也最好能够检测出这种短的音波段。我们先前的研究引入了第一个适合为 PS 预言式培训 CMS 前的语音数据库。虽然可以培训CMM(PM) 只能进行上述两种类型的探测,但是还有很大的改进余地。在本文中,我们建议改进了20种更精确的音频信号,在第二个时间段中可以显示我们最新的变距分辨率分辨率分辨率,我们提出的升级的图像显示升级的版本。