The rapid spread of media content synthesis technology and the potentially damaging impact of audio and video deepfakes on people's lives have raised the need to implement systems able to detect these forgeries automatically. In this work we present a novel approach for synthetic speech detection, exploiting the combination of two high-level semantic properties of the human voice. On one side, we focus on speaker identity cues and represent them as speaker embeddings extracted using a state-of-the-art method for the automatic speaker verification task. On the other side, voice prosody, intended as variations in rhythm, pitch or accent in speech, is extracted through a specialized encoder. We show that the combination of these two embeddings fed to a supervised binary classifier allows the detection of deepfake speech generated with both Text-to-Speech and Voice Conversion techniques. Our results show improvements over the considered baselines, good generalization properties over multiple datasets and robustness to audio compression.
翻译:媒体内容合成技术的迅速传播以及音频和视频深层假象对人们生活的潜在破坏性影响,使人们更有必要实施能够自动检测这些伪造的系统。在这项工作中,我们提出了一个合成语音探测新颖方法,利用人类声音的两个高层次语义特性的结合。一方面,我们侧重于语音识别提示,并把它们作为使用最先进的自动语音验证方法提取出来的语音嵌入器。另一方面,通过专门的编码器提取语音推进器,其用意是音频、音频或口音的变异。我们表明,这两个嵌入器的结合使得能够检测通过文字对语音和语音转换技术生成的深假言。我们的结果显示,在考虑过的基线、多个数据集的良好通用特性和音频压缩的稳健性方面有所改进。