The vulnerability of deep neural networks to adversarial attacks has been widely demonstrated (e.g., adversarial example attacks). Traditional attacks perform unstructured pixel-wise perturbation to fool the classifier. An alternative approach is to have perturbations in the latent space. However, such perturbations are hard to control due to the lack of interpretability and disentanglement. In this paper, we propose a more practical adversarial attack by designing structured perturbation with semantic meanings. Our proposed technique manipulates the semantic attributes of images via the disentangled latent codes. The intuition behind our technique is that images in similar domains have some commonly shared but theme-independent semantic attributes, e.g. thickness of lines in handwritten digits, that can be bidirectionally mapped to disentangled latent codes. We generate adversarial perturbation by manipulating a single or a combination of these latent codes and propose two unsupervised semantic manipulation approaches: vector-based disentangled representation and feature map-based disentangled representation, in terms of the complexity of the latent codes and smoothness of the reconstructed images. We conduct extensive experimental evaluations on real-world image data to demonstrate the power of our attacks for black-box classifiers. We further demonstrate the existence of a universal, image-agnostic semantic adversarial example.
翻译:深心神经网络对对抗性攻击的脆弱性已被广泛证明(例如,对抗性例子攻击)。传统攻击表现了非结构化的像素智慧的扰动,以愚弄分类者。另一种办法是在潜伏空间中出现扰动。然而,由于缺乏解释性和分解,这种扰动很难控制。在本文中,我们建议通过设计带有语义含义的结构交错来进行更实际的对抗性攻击。我们提出的技术通过分解的潜在代码来操纵图像的语义特征。我们技术背后的直觉是类似域的图像具有一些共同但主题独立的语义特征,例如手写数字线的厚度,这可以双向地绘制以解动潜在代码。我们通过操纵单一或组合这些潜在代码来产生对抗性攻击。我们提出的两种不精确的语义操纵方法:基于矢量的不相交织的表达方式和基于地图的不相交织的图像特征。我们从实验性模型的复杂度的角度,展示了我们真实图像的清晰度,从而展示了我们真实的图像的真实性。