Whisper is a recent Automatic Speech Recognition (ASR) model displaying impressive robustness to both out-of-distribution inputs and random noise. In this work, we show that this robustness does not carry over to adversarial noise. We generate very small input perturbations with Signal Noise Ratio of up to 45dB, with which we can degrade Whisper performance dramatically, or even transcribe a target sentence of our choice. We also show that by fooling the Whisper language detector we can very easily degrade the performance of multilingual models. These vulnerabilities of a widely popular open-source model have practical security implications, and emphasize the need for adversarially robust ASR.
翻译:耳语是最近一个自动语音识别(ASR)模型,展示了传播外的投入和随机噪音的强势,令人印象深刻。在这项工作中,我们证明这种稳健性不会传到对抗性噪音上。我们用高达45dB的信号噪音率生成了很小的输入干扰,我们可以以此大幅降低Woseper的性能,甚至改写我们选择的目标句子。我们还表明,通过欺骗Wistper语言探测器,我们可以很容易地降低多语言模型的性能。一个广为流行的开放源码模型的这些弱点具有实际的安全影响,并强调需要具有对抗性强的ASR。