With the increasing attention to deep neural network (DNN) models, attacks are also upcoming for such models. For example, an attacker may carefully construct images in specific ways (also referred to as adversarial examples) aiming to mislead the DNN models to output incorrect classification results. Similarly, many efforts are proposed to detect and mitigate adversarial examples, usually for certain dedicated attacks. In this paper, we propose a novel digital watermark-based method to generate image adversarial examples to fool DNN models. Specifically, partial main features of the watermark image are embedded into the host image almost invisibly, aiming to tamper with and damage the recognition capabilities of the DNN models. We devise an efficient mechanism to select host images and watermark images and utilize the improved discrete wavelet transform (DWT) based Patchwork watermarking algorithm with a set of valid hyperparameters to embed digital watermarks from the watermark image dataset into original images for generating image adversarial examples. The experimental results illustrate that the attack success rate on common DNN models can reach an average of 95.47% on the CIFAR-10 dataset and the highest at 98.71%. Besides, our scheme is able to generate a large number of adversarial examples efficiently, concretely, an average of 1.17 seconds for completing the attacks on each image on the CIFAR-10 dataset. In addition, we design a baseline experiment using the watermark images generated by Gaussian noise as the watermark image dataset that also displays the effectiveness of our scheme. Similarly, we also propose the modified discrete cosine transform (DCT) based Patchwork watermarking algorithm. To ensure repeatability and reproducibility, the source code is available on GitHub.
翻译:随着对深神经网络(DNN)模型的日益关注,攻击也即将发生。例如,攻击者可能会以特定方式(也称为对抗性例子)仔细地构建图像,目的是误导DNN模型,以得出错误的分类结果。同样,我们提议作出许多努力,探测和减少对抗性实例,通常在某些专门攻击中这样做。在本文中,我们提议一种基于数字标记的新方法,为愚弄DNN模型制作图像的对抗性实例。具体地说,水印图像的部分主要特征几乎隐形地嵌入主机图像,目的是改变和破坏DNN模型的识别能力。我们设计了一个高效的机制,选择主机图像和水记图像,并利用改良的离散波纹变换模型(DWT),用一套有效的超参数将水印图像集输入原始图像中。实验结果表明,通用DNNNM模型的攻击成功率在CIFAR-10数据库中达到95.47%的平均值,在98.71%的数值模型中,并且用最高级的图像变换了我们的标准模型。此外,我们还可以在18-10年的模型中绘制一个数据。