This paper introduces a novel adversarial algorithm for attacking the state-of-the-art speech-to-text systems, namely DeepSpeech, Kaldi, and Lingvo. Our approach is based on developing an extension for the conventional distortion condition of the adversarial optimization formulation using the Cram\`er integral probability metric. Minimizing over this metric, which measures the discrepancies between original and adversarial samples' distributions, contributes to crafting signals very close to the subspace of legitimate speech recordings. This helps to yield more robust adversarial signals against playback over-the-air without employing neither costly expectation over transformation operations nor static room impulse response simulations. Our approach outperforms other targeted and non-targeted algorithms in terms of word error rate and sentence-level-accuracy with competitive performance on the crafted adversarial signals' quality. Compared to seven other strong white and black-box adversarial attacks, our proposed approach is considerably more resilient against multiple consecutive playbacks over-the-air, corroborating its higher robustness in noisy environments.
翻译:本文为攻击最先进的语音到文字系统引入了一种新的对抗性算法,即DeepSpeech、Kaldi和Lingvo。 我们的方法基于利用Cram ⁇ er整体概率度指标开发对对抗性优化配方常规扭曲条件的延伸。 尽可能缩小这一衡量原始和对抗性样本分布差异的尺度,有助于生成非常接近合法语音录音亚空间的信号。 这有助于产生更强大的对抗反弹反弹信号,而不会对转换操作或静态室脉冲反应模拟产生昂贵的预期。 我们的方法在字差率和判决水平上优于其他目标和非目标的算法,在精心设计的对抗性信号质量上具有竞争性的表现。 与其他7个强大的白箱和黑箱对抗性攻击相比,我们提出的方法对多次连续反弹反弹具有相当大的弹性,证实了其在噪音环境中的强大强势。