The past few years have witnessed the significant advances of speech synthesis and voice conversion technologies. However, such technologies can undermine the robustness of broadly implemented biometric identification models and can be harnessed by in-the-wild attackers for illegal uses. The ASVspoof challenge mainly focuses on synthesized audios by advanced speech synthesis and voice conversion models, and replay attacks. Recently, the first Audio Deep Synthesis Detection challenge (ADD 2022) extends the attack scenarios into more aspects. Also ADD 2022 is the first challenge to propose the partially fake audio detection task. Such brand new attacks are dangerous and how to tackle such attacks remains an open question. Thus, we propose a novel framework by introducing the question-answering (fake span discovery) strategy with the self-attention mechanism to detect partially fake audios. The proposed fake span detection module tasks the anti-spoofing model to predict the start and end positions of the fake clip within the partially fake audio, address the model's attention into discovering the fake spans rather than other shortcuts with less generalization, and finally equips the model with the discrimination capacity between real and partially fake audios. Our submission ranked second in the partially fake audio detection track of ADD 2022.
翻译:过去几年来,语音合成和语音转换技术取得了显著进步;然而,这类技术可能破坏广泛应用的生物鉴别识别模型的稳健性,并可由现场攻击者用于非法用途。ASVspoof挑战主要侧重于通过先进的语音合成和语音转换模型合成声音,以及重播攻击。最近,第一个声音深合成探测挑战(ADD 2022)将攻击情景扩大到更多方面。另外,ADD 2022是提出部分假音探测任务的第一个挑战。这种品牌新攻击非常危险,如何应对这种攻击仍然是一个尚未解决的问题。因此,我们提出了一个新框架,即采用自省机制的问答(假冒光谱发现)战略来探测部分假音频。拟议的假冒探测模块要求防雾模型预测部分假音频的开始和结束位置,该模型关注的是发现假的间隔,而不是其他捷径,最终使模型具备真实音频和部分假音频22之间的歧视能力。我们提交的文件将第2号轨道列为第2号。