Recent synthetic speech detection models typically adapt a pre-trained SSL model via finetuning, which is computationally demanding. Parameter-Efficient Fine-Tuning (PEFT) offers an alternative. However, existing methods lack the specific inductive biases required to model the multi-scale temporal artifacts characteristic of spoofed audio. This paper introduces the Multi-Scale Convolutional Adapter (MultiConvAdapter), a parameter-efficient architecture designed to address this limitation. MultiConvAdapter integrates parallel convolutional modules within the SSL encoder, facilitating the simultaneous learning of discriminative features across multiple temporal resolutions, capturing both short-term artifacts and long-term distortions. With only $3.17$M trainable parameters ($1\%$ of the SSL backbone), MultiConvAdapter substantially reduces the computational burden of adaptation. Evaluations on five public datasets, demonstrate that MultiConvAdapter achieves superior performance compared to full fine-tuning and established PEFT methods.
翻译:近期的合成语音检测模型通常通过微调预训练的SSL模型来实现,这种方法计算成本较高。参数高效微调(PEFT)提供了一种替代方案。然而,现有方法缺乏建模欺骗音频特有的多尺度时序伪影所需的特定归纳偏置。本文提出了多尺度卷积适配器(MultiConvAdapter),一种旨在解决此限制的参数高效架构。MultiConvAdapter在SSL编码器中集成了并行卷积模块,促进了跨多个时间分辨率的判别性特征的同时学习,能够捕获短期伪影和长期失真。仅使用$3.17$M可训练参数(占SSL骨干网络的$1\\%$),MultiConvAdapter显著降低了适配的计算负担。在五个公共数据集上的评估表明,与完全微调及现有PEFT方法相比,MultiConvAdapter实现了更优的性能。