Machine learning models have demonstrated vulnerability to adversarial attacks, more specifically misclassification of adversarial examples. In this paper, we investigate an attack-agnostic defense against adversarial attacks on high-resolution images by detecting suspicious inputs. The intuition behind our approach is that the essential characteristics of a normal image are generally consistent with non-essential style transformations, e.g., slightly changing the facial expression of human portraits. In contrast, adversarial examples are generally sensitive to such transformations. In our approach to detect adversarial instances, we propose an in\underline{V}ertible \underline{A}utoencoder based on the \underline{S}tyleGAN2 generator via \underline{A}dversarial training (VASA) to inverse images to disentangled latent codes that reveal hierarchical styles. We then build a set of edited copies with non-essential style transformations by performing latent shifting and reconstruction, based on the correspondences between latent codes and style transformations. The classification-based consistency of these edited copies is used to distinguish adversarial instances.
翻译:机器学习模型显示很容易受到对抗性攻击, 更具体地说, 对抗性例子的分类有误。 在本文中, 我们通过检测可疑的输入, 调查对高分辨率图像的对抗性攻击的进攻性不可知性防御。 我们的方法的直觉是, 普通图像的基本特征一般都与非必要的风格转变相一致, 例如, 轻微改变人类肖像的面部表达方式。 相反, 对抗性实例一般对这种转变很敏感。 在我们探测对抗性实例的方法中, 我们建议根据下线{ 虚拟代码与风格转变之间的对应关系, 以\ underline{ S} tyleGAN2 生成器为基础, 通过下线{ A} dversarial 培训( VASA) 来反调图像以解开显示等级风格的潜伏代码。 我们随后根据潜伏性变化与重建, 建立一套经过编辑的、 带有非典型风格转变的、 进行潜伏性转移与重建的文本。 这些经编辑的文本的分类一致性被用来区分对抗性实例 。