Artefacts that serve to distinguish bona fide speech from spoofed or deepfake speech are known to reside in specific subbands and temporal segments. Various approaches can be used to capture and model such artefacts, however, none works well across a spectrum of diverse spoofing attacks. Reliable detection then often depends upon the fusion of multiple detection systems, each tuned to detect different forms of attack. In this paper we show that better performance can be achieved when the fusion is performed within the model itself and when the representation is learned automatically from raw waveform inputs. The principal contribution is a spectro-temporal graph attention network (GAT) which learns the relationship between cues spanning different sub-bands and temporal intervals. Using a model-level graph fusion of spectral (S) and temporal (T) sub-graphs and a graph pooling strategy to improve discrimination, the proposed RawGAT-ST model achieves an equal error rate of 1.06 % for the ASVspoof 2019 logical access database. This is one of the best results reported to date and is reproducible using an open source implementation.
翻译:用来区分善意言词和虚伪或深假言词的人工制品已知存在于特定的子带和时间段中,但可以使用各种方法来捕捉和模拟这种工艺品,但是,在各种不同的表面攻击中,没有任何一种方法能很好地发挥作用。可靠探测往往取决于多种探测系统的结合,每个系统都经过调整以探测不同攻击形式。在本文中,我们表明,当聚合在模型内进行时,以及当演示从原始波形输入中自动学习时,能够取得更好的性能。主要贡献是光谱时图关注网络,它学习着不同次波段和时间间隔的提示之间的关系。利用光谱和时间(T)子图的模型和图集战略来改善歧视,拟议的RawGAT-ST模型在ASVspoof 2019逻辑访问数据库中实现了1.06%的相同误差率。这是迄今报告的最佳结果之一,并且可以使用开放源执行来重新复制。