Existing approaches for replay and synthetic speech detection still lack generalizability to unseen spoofing attacks. This work proposes to leverage a novel model structure, so-called Res2Net, to improve the anti-spoofing countermeasure's generalizability. Res2Net mainly modifies the ResNet block to enable multiple feature scales. Specifically, it splits the feature maps within one block into multiple channel groups and designs a residual-like connection across different channel groups. Such connection increases the possible receptive fields, resulting in multiple feature scales. This multiple scaling mechanism significantly improves the countermeasure's generalizability to unseen spoofing attacks. It also decreases the model size compared to ResNet-based models. Experimental results show that the Res2Net model consistently outperforms ResNet34 and ResNet50 by a large margin in both physical access (PA) and logical access (LA) of the ASVspoof 2019 corpus. Moreover, integration with the squeeze-and-excitation (SE) block can further enhance performance. For feature engineering, we investigate the generalizability of Res2Net combined with different acoustic features, and observe that the constant-Q transform (CQT) achieves the most promising performance in both PA and LA scenarios. Our best single system outperforms other state-of-the-art single systems in both PA and LA of the ASVspoof 2019 corpus.
翻译:重放和合成语音探测的现有方法仍然缺乏对隐蔽攻击的通用性。 这项工作提议利用新型模型结构, 即所谓的Res2Net, 改进反潜射反制措施的通用性。 Res2Net主要修改 ResNet 块块, 以便实现多重特性尺度。 具体地说, 它将一个块内的特写地图分成多个频道组, 设计不同频道组之间的类似剩余连接。 这种连接增加了可能的可接收域, 导致多重特性尺度。 这个多重缩放机制大大改进了反制措施对隐蔽攻击的通用性。 它还降低了与基于ResNet的模型相比的模型大小。 实验结果显示, Res2Net 模型在实际访问( PA) 和 ASVspoof 2019 集合体的逻辑访问(LAAA) 中, 和 ARC 最有希望的AQ 系统, 和 AS- CAQ 的单一系统, 都实现了一个不变的状态。