Recently, adversarial attacks for audio recognition have attracted much attention. However, most of the existing studies mainly rely on the coarse-grain audio features at the instance level to generate adversarial noises, which leads to expensive generation time costs and weak universal attacking ability. Motivated by the observations that all audio speech consists of fundamental phonemes, this paper proposes a phonemic adversarial tack (PAT) paradigm, which attacks the fine-grain audio features at the phoneme level commonly shared across audio instances, to generate phonemic adversarial noises, enjoying the more general attacking ability with fast generation speed. Specifically, for accelerating the generation, a phoneme density balanced sampling strategy is introduced to sample quantity less but phonemic features abundant audio instances as the training data via estimating the phoneme density, which substantially alleviates the heavy dependency on the large training dataset. Moreover, for promoting universal attacking ability, the phonemic noise is optimized in an asynchronous way with a sliding window, which enhances the phoneme diversity and thus well captures the critical fundamental phonemic patterns. By conducting extensive experiments, we comprehensively investigate the proposed PAT framework and demonstrate that it outperforms the SOTA baselines by large margins (i.e., at least 11X speed up and 78% attacking ability improvement).
翻译:最近,声音识别方面的对抗性攻击引起了许多注意,然而,大多数现有研究主要依靠实例一级的粗粗谷地音频特征,产生对抗性噪音,导致产生昂贵的发电时间成本和薄弱的普遍攻击能力。由于观察到所有音频言论都由基本电话组成,本文建议采用电话对抗性辩论塔克(PAT)模式,在音频场之间共同分享的电话上打击细谷地音频特征,产生声频对抗性噪音,以快速发电的速度享有更普遍的攻击能力。具体来说,为了加速发电,引入了电话密度均衡的取样战略,抽样数量较少,但具有电话特征。大量音频实例,如通过估计电话密度,大大减轻对大型培训数据集的高度依赖的培训数据。此外,为了提高普遍攻击能力,语音噪音在以滑动窗口的无序方式得到优化,从而增强电话多样性,从而捕捉到关键的基本电话模式。通过进行广泛的实验,我们全面调查拟议PAT框架的速度平衡,通过大幅度缩小SOX的距离,显示SOTA的基线。