Despite the remarkable performance and generalization levels of deep learning models in a wide range of artificial intelligence tasks, it has been demonstrated that these models can be easily fooled by the addition of imperceptible yet malicious perturbations to natural inputs. These altered inputs are known in the literature as adversarial examples. In this paper, we propose a novel probabilistic framework to generalize and extend adversarial attacks in order to produce a desired probability distribution for the classes when we apply the attack method to a large number of inputs. This novel attack paradigm provides the adversary with greater control over the target model, thereby exposing, in a wide range of scenarios, threats against deep learning models that cannot be conducted by the conventional paradigms. We introduce four different strategies to efficiently generate such attacks, and illustrate our approach by extending multiple adversarial attack algorithms. We also experimentally validate our approach for the spoken command classification task and the Tweet emotion classification task, two exemplary machine learning problems in the audio and text domain, respectively. Our results demonstrate that we can closely approximate any probability distribution for the classes while maintaining a high fooling rate and even prevent the attacks from being detected by label-shift detection methods.
翻译:尽管在广泛的人工智能任务中,深层次的深层次学习模式表现和普遍程度相当,但事实证明,这些模式很容易被自然投入中添加的不可察觉但恶意干扰的威胁所蒙骗。文献中将这些改变的投入称为对抗性实例。在本文中,我们提出了一个新颖的概率框架,以概括和扩展对抗性攻击,从而在对大量投入应用攻击方法时,为各个类别产生理想的概率分布。这个新颖的攻击模式为对手提供了对目标模式的更大控制,从而在广泛的情景中暴露了对无法由传统模式实施的深层次学习模式的威胁。我们引入了四种不同的战略,以有效产生这种攻击,并通过扩大多重对抗性攻击算法来说明我们的做法。我们还实验性地验证了我们对于口头指令分类任务和Tweet情绪分类任务的方法,即分别对音频和文字领域的两个示范性机器学习问题。我们的结果表明,我们可以在保持高愚昧率甚至防止通过标签定位探测方法探测攻击的概率分布。