We develop and study new adversarial perturbations that enable an attacker to gain control over decisions in generic Artificial Intelligence (AI) systems including deep learning neural networks. In contrast to adversarial data modification, the attack mechanism we consider here involves alterations to the AI system itself. Such a stealth attack could be conducted by a mischievous, corrupt or disgruntled member of a software development team. It could also be made by those wishing to exploit a ``democratization of AI'' agenda, where network architectures and trained parameter sets are shared publicly. We develop a range of new implementable attack strategies with accompanying analysis, showing that with high probability a stealth attack can be made transparent, in the sense that system performance is unchanged on a fixed validation set which is unknown to the attacker, while evoking any desired output on a trigger input of interest. The attacker only needs to have estimates of the size of the validation set and the spread of the AI's relevant latent space. In the case of deep learning neural networks, we show that a one neuron attack is possible - a modification to the weights and bias associated with a single neuron - revealing a vulnerability arising from over-parameterization. We illustrate these concepts using state of the art architectures on two standard image data sets. Guided by the theory and computational results, we also propose strategies to guard against stealth attacks.
翻译:我们开发并研究新的对抗性扰动,使攻击者能够控制通用人工智能系统(AI)的决策,包括深层学习神经网络。与对数据进行对抗性修改相比,我们这里认为的攻击机制涉及对AI系统本身的修改。这种隐形攻击可以由一个软件开发团队的恶意、腐败或不满的成员进行。也可以由那些希望利用AI议程的“民主化”的人进行,在那里,网络架构和受过训练的参数组可以公开共享。我们制定了一系列新的可执行攻击战略,并伴有分析,表明极有可能使隐形攻击变得透明,也就是说,系统性能在固定的验证组上保持不变,而攻击者并不知道,同时援引任何想要在触发兴趣投入时产生的输出。攻击者只需估计AI议程的大小以及AI相关隐性空间的扩展。在深层学习神经网络中,我们表明,一种神经攻击是可能的—— 一种对重力和偏向偏向性攻击的新型战略的改变, 一种由我们用单一的神经结构的模型来显示, 一种我们用一种对神经结构结构的脆弱性结构,我们用一种实验的模型来显示这些结构。