Convolutional neural networks (CNNs) are fragile to small perturbations in the input images. These networks are thus prone to malicious attacks that perturb the inputs to force a misclassification. Such slightly manipulated images aimed at deceiving the classifier are known as adversarial images. In this work, we investigate statistical differences between natural images and adversarial ones. More precisely, we show that employing a proper image transformation and for a class of adversarial attacks, the distribution of the leading digit of the pixels in adversarial images deviates from Benford's law. The stronger the attack, the more distant the resulting distribution is from Benford's law. Our analysis provides a detailed investigation of this new approach that can serve as a basis for alternative adversarial example detection methods that do not need to modify the original CNN classifier neither work on the raw high-dimensional pixels as features to defend against attacks.
翻译:这些网络很容易受到恶意攻击,从而干扰输入过程,从而迫使进行错误分类。这些被轻度操纵的图像被称为对抗性图像。在这项工作中,我们调查自然图像和对抗性图像之间的统计差异。更准确地说,我们表明,采用适当的图像转换和对抗性攻击的类别,对立图像中像素的前位数的分布偏离了Benford的法律。攻击越强,由此产生的分布就越远,来自Benford的法律。我们的分析提供了对这一新方法的详细调查,这些新办法可以作为替代的对抗性示例探测方法的基础,这些方法无需修改原CNN的原始高维像素,也无需修改原始高维像素的特征来抵御攻击。</s>