We propose a novel technique that can generate natural-looking adversarial examples by bounding the variations induced for internal activation values in some deep layer(s), through a distribution quantile bound and a polynomial barrier loss function. By bounding model internals instead of individual pixels, our attack admits perturbations closely coupled with the existing features of the original input, allowing the generated examples to be natural-looking while having diverse and often substantial pixel distances from the original input. Enforcing per-neuron distribution quantile bounds allows addressing the non-uniformity of internal activation values. Our evaluation on ImageNet and five different model architecture demonstrates that our attack is quite effective. Compared to the state-of-the-art pixel space attack, semantic attack, and feature space attack, our attack can achieve the same attack success/confidence level while having much more natural-looking adversarial perturbations. These perturbations piggy-back on existing local features and do not have any fixed pixel bounds.
翻译:我们提出一种新的技术,通过将某些深层层内部活化值的变量,通过分布式四分位捆绑和多式屏障丧失功能,将驱动内部活化值的变量捆绑起来,从而产生自然的对抗性实例。通过将模型内部内部而不是单个像素捆绑起来,我们的攻击承认了与原始输入的现有特征密切结合的扰动,使得生成的示例具有自然的外观,而与原始输入的距离则多种多样,而且往往相当大。 强化每个中子分布式孔径界限,可以解决内部活化值的不一致性问题。 我们对图像网络和五个不同的模型结构的评估表明,我们的攻击相当有效。与最先进的像素空间攻击、语义攻击和特征空间攻击相比,我们的攻击可以达到同样的攻击成功/自信水平,同时具有更自然的对抗性触动性。这些扰动性盘在现有的本地特征上搭载,没有固定的像素约束。