In this paper, we introduce a digital audio forensics approach called Forensic Similarity for Speech Deepfakes, which determines whether two audio segments contain the same forensic traces or not. Our work is inspired by prior work in the image domain on forensic similarity, which proved strong generalization capabilities against unknown forensic traces, without requiring prior knowledge of them at training time. To achieve this in the audio setting, we propose a two-part deep-learning system composed of a feature extractor based on a speech deepfake detector backbone and a shallow neural network, referred to as the similarity network. This system maps pairs of audio segments to a score indicating whether they contain the same or different forensic traces. We evaluate the system on the emerging task of source verification, highlighting its ability to identify whether two samples originate from the same generative model. Additionally, we assess its applicability to splicing detection as a complementary use case. Experiments show that the method generalizes to a wide range of forensic traces, including previously unseen ones, illustrating its flexibility and practical value in digital audio forensics.
翻译:本文提出了一种名为"语音深度伪造取证相似性"的数字音频取证方法,该方法用于判断两个音频片段是否包含相同的取证痕迹。我们的研究受到图像领域先前关于取证相似性工作的启发,该方法在训练时无需事先了解未知取证痕迹的情况下,已证明对未知痕迹具有强大的泛化能力。为实现音频领域的这一目标,我们提出了一种由两部分组成的深度学习系统:基于语音深度伪造检测器主干网络的特征提取器,以及被称为相似性网络的浅层神经网络。该系统将音频片段对映射为一个分数,用以指示它们是否包含相同或不同的取证痕迹。我们在新兴的源验证任务上评估了该系统,突显了其识别两个样本是否源自同一生成模型的能力。此外,我们还评估了其在拼接检测作为补充用例中的适用性。实验表明,该方法能够泛化至广泛的取证痕迹(包括先前未见过的痕迹),这说明了其在数字音频取证领域的灵活性和实用价值。