以数据为中心的可泛化语音深度伪造检测方法 (A Data-Centric Approach to Generalizable Speech Deepfake Detection)

Achieving robust generalization in speech deepfake detection (SDD) remains a primary challenge, as models often fail to detect unseen forgery methods. While research has focused on model-centric and algorithm-centric solutions, the impact of data composition is often underexplored. This paper proposes a data-centric approach, analyzing the SDD data landscape from two practical perspectives: constructing a single dataset and aggregating multiple datasets. To address the first perspective, we conduct a large-scale empirical study to characterize the data scaling laws for SDD, quantifying the impact of source and generator diversity. To address the second, we propose the Diversity-Optimized Sampling Strategy (DOSS), a principled framework for mixing heterogeneous data with two implementations: DOSS-Select (pruning) and DOSS-Weight (re-weighting). Our experiments show that DOSS-Select outperforms the naive aggregation baseline while using only 3% of the total available data. Furthermore, our final model, trained on a 12k-hour curated data pool using the optimal DOSS-Weight strategy, achieves state-of-the-art performance, outperforming large-scale baselines with greater data and model efficiency on both public benchmarks and a new challenge set of various commercial APIs.

翻译：在语音深度伪造检测（SDD）中实现稳健的泛化能力仍然是一个主要挑战，因为模型往往无法检测到未见过的伪造方法。尽管研究主要集中在以模型为中心和以算法为中心的解决方案上，但数据构成的影响却常常未被充分探索。本文提出了一种以数据为中心的方法，从两个实际角度分析SDD数据格局：构建单一数据集和聚合多个数据集。针对第一个角度，我们进行了一项大规模实证研究，以刻画SDD的数据缩放规律，量化数据来源和生成器多样性的影响。针对第二个角度，我们提出了多样性优化采样策略（DOSS），这是一个用于混合异构数据的原理性框架，包含两种实现方式：DOSS-Select（剪枝）和DOSS-Weight（重加权）。我们的实验表明，DOSS-Select在使用仅占总可用数据3%的情况下，性能优于简单的聚合基线。此外，我们使用最优的DOSS-Weight策略在一个经过筛选的12k小时数据池上训练的最终模型，在公开基准测试和一个包含各种商业API的新挑战集上，均取得了最先进的性能，并且在数据效率和模型效率方面均优于大规模基线模型。