State-of-the-art methods for audio generation suffer from fingerprint artifacts and repeated inconsistencies across temporal and spectral domains. Such artifacts could be well captured by the frequency domain analysis over the spectrogram. Thus, we propose a novel use of long-range spectro-temporal modulation feature -- 2D DCT over log-Mel spectrogram for the audio deepfake detection. We show that this feature works better than log-Mel spectrogram, CQCC, MFCC, etc., as a suitable candidate to capture such artifacts. We employ spectrum augmentation and feature normalization to decrease overfitting and bridge the gap between training and test dataset along with this novel feature introduction. We developed a CNN-based baseline that achieved a 0.0849 t-DCF and outperformed the best single system reported in the ASVspoof 2019 challenge. Finally, by combining our baseline with our proposed 2D DCT spectro-temporal feature, we decrease the t-DCF score down by 14% to 0.0737, making it one of the best systems for spoofing detection. Furthermore, we evaluate our model using two external datasets, showing the proposed feature's generalization ability. We also provide analysis and ablation studies for our proposed feature and results.
翻译:用于音频生成的最新方法有指纹制品,以及时空和光谱域间反复出现的不一致之处。这些艺术品可以通过光谱谱谱谱谱的频率域分析来很好地捕捉。因此,我们提议以新颖的方式使用远程光谱-时空调功能 -- -- 2DDCT高于日志-Mel光谱仪,用于音频深假检测。我们表明,这一功能比用于获取这些文物的适当候选人的日志-Mel光谱、CQCC、MFCC等功能要好得多。我们使用频谱增强和特征正常化来减少培训和测试数据集之间的超标度并缩小差距,同时采用这个新颖的特征介绍。我们开发了一个基于CNN的基线,实现了0.0849 t-DCF,并超过了在2019年ASVspof挑战中报告的最佳单一系统。最后,通过将我们的基线与我们提议的2DCT光谱-时空特征结合,我们把t-DCF的评分数降低14 %到0.0737,从而缩小了培训和测试数据集,从而缩小了培训和测试功能中的最佳系统之一。我们还利用了拟议中的外部分析功能。我们提出了的模型,我们提供了一种模型,还提供了一种模型,并展示了我们提出的一般分析。