Current text-to-speech algorithms produce realistic fakes of human voices, making deepfake detection a much-needed area of research. While researchers have presented various techniques for detecting audio spoofs, it is often unclear exactly why these architectures are successful: Preprocessing steps, hyperparameter settings, and the degree of fine-tuning are not consistent across related work. Which factors contribute to success, and which are accidental? In this work, we address this problem: We systematize audio spoofing detection by re-implementing and uniformly evaluating architectures from related work. We identify overarching features for successful audio deepfake detection, such as using cqtspec or logspec features instead of melspec features, which improves performance by 37% EER on average, all other factors constant. Additionally, we evaluate generalization capabilities: We collect and publish a new dataset consisting of 37.9 hours of found audio recordings of celebrities and politicians, of which 17.2 hours are deepfakes. We find that related work performs poorly on such real-world data (performance degradation of up to one thousand percent). This may suggest that the community has tailored its solutions too closely to the prevailing ASVSpoof benchmark and that deepfakes are much harder to detect outside the lab than previously thought.
翻译:目前文本到语音的算法产生现实的人类声音假象,使深假探测成为非常需要的研究领域。虽然研究人员已经展示了各种探测声波孔的技术,但往往不清楚这些结构为什么成功:预处理步骤、超参数设置和微调程度在相关工作之间不一致。哪些因素有助于成功,哪些是意外的?在这项工作中,我们解决这个问题:我们通过重新实施和统一评价相关工作的架构,将音波探测系统化。我们发现了成功音波探测的主要特征,例如使用Cqtspec或日志特征,而不是中位特征,这些特征平均提高37% EER的性能,所有其他因素保持不变。此外,我们评估了总体能力:我们收集并出版了一个由37.9小时的名人和政治家录音记录组成的新数据集,其中17.2小时是深错的。我们发现,相关工作在这种真实世界数据上表现很差(表现到千分之差)。这也许表明,社区在深层次的实验室范围上,比其深层次的测算得的实验室解决方案更难。