用于探测嵌在一种变化中的短期假言部分部分的部分数据库和反措施 (The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance)

Automatic speaker verification is susceptible to various manipulations and spoofing, such as text-to-speech synthesis, voice conversion, replay, tampering, adversarial attacks, and so on. We consider a new spoofing scenario called "Partial Spoof" (PS) in which synthesized or transformed speech segments are embedded into a bona fide utterance. While existing countermeasures (CMs) can detect fully spoofed utterances, there is a need for their adaptation or extension to the PS scenario. We propose various improvements to construct a significantly more accurate CM that can detect and locate short-generated spoofed speech segments at finer temporal resolutions. First, we introduce newly developed self-supervised pre-trained models as enhanced feature extractors. Second, we extend our PartialSpoof database by adding segment labels for various temporal resolutions. Since the short spoofed speech segments to be embedded by attackers are of variable length, six different temporal resolutions are considered, ranging from as short as 20 ms to as large as 640 ms. Third, we propose a new CM that enables the simultaneous use of the segment-level labels at different temporal resolutions as well as utterance-level labels to execute utterance- and segment-level detection at the same time. We also show that the proposed CM is capable of detecting spoofing at the utterance level with low error rates in the PS scenario as well as in a related logical access (LA) scenario. The equal error rates of utterance-level detection on the PartialSpoof database and ASVspoof 2019 LA database were 0.77 and 0.90%, respectively.

翻译：自动语音部分的校验容易受到各种操纵和欺骗,例如文本到语音合成、语音转换、重放、篡改、对抗性攻击等等。我们考虑一种称为“partial Spoof” (PS) 的新的假话假设情景,其中合成或变换的语音部分嵌入一个善意的表达式。虽然现有的反措施(CMs)能够检测出完全虚假的发音,但需要调整或扩展到PS情景。我们提出各种改进,以构建一个更精确得多的CM,能够探测和定位短发的语音部分。首先,我们采用新开发的自我监督的预训练模型作为增强的功能提取器。其次,我们扩展我们的部分Spoof数据库,为各种时间分辨率添加部分标签。由于攻击者嵌入的短发语音部分的长度不一长,我们考虑的六个不同的时间分辨率,从短到大到640米。第三,我们提议一个新的CMEV级的测算方法能够探测并定位短出短出短音部分的语音部分的语音部分。我们建议一个新的CMMMD,在20级的升级的升级和直超时段级别上同时使用Seralalalalalalalalalalalal lades, lavealalalalalalal lavealalalalalalalalal laction lavealalalal lave lave lave lave lave lave lad lave lave lave lave lave lave lave lavealal lave lave lave ladal lave lave lave lave lave lave lave lave lave laveal lavealalalalalalalal 级级级级级级级级为在20 ladalalalalalalalalalalalalalalalalalalalalalalalalalalalalalalalalalalalalalalal 级别上,在20 级别上,在20 ladalalalalalalalalalalalalalalalalalalalalalalalalalalalalalalal