The malicious use of deep speech synthesis models may pose significant threat to society. Therefore, many studies have emerged to detect the so-called ``deepfake audio". However, these studies focus on the binary detection of real audio and fake audio. For some realistic application scenarios, it is needed to know what tool or model generated the deepfake audio. This raises a question: Can we recognize the system fingerprints of deepfake audio? Therefore, in this paper, we propose a deepfake audio dataset for system fingerprint recognition (SFR) and conduct an initial investigation. We collected the dataset from five speech synthesis systems using the latest state-of-the-art deep learning technologies, including both clean and compressed sets. In addition, to facilitate the further development of system fingerprint recognition methods, we give researchers some benchmarks that can be compared, and research findings. The dataset will be publicly available.
翻译:恶意使用深层语音合成模型可能会对社会造成重大威胁。 因此,许多研究已经出现,以探测所谓的“深假音频 ” 。 然而,这些研究侧重于对真实音频和假音频的二进制检测。 对于一些现实的应用情景,需要知道是什么工具或模型生成了深假音频。 这就提出了一个问题:我们能否识别深假音频的系统指纹? 因此,在本文件中,我们提议建立一个深假音频数据集,用于系统指纹识别(SFR)并进行初步调查。 我们利用最新的最先进的深层学习技术,包括清洁和压缩机集,从五个语音合成系统中收集数据集。此外,为了便利系统指纹识别方法的进一步发展,我们给研究人员一些可以比较的基准和研究结果。 数据集将公布于众。