Deep neural networks are vulnerable to adversarial attacks. We consider adversarial defense in the case of zero-shot image classification setting, which has rarely been explored because both adversarial defense and zero-shot learning are challenging. We propose LAAT, a novel Language-driven, Anchor-based Adversarial Training strategy, to improve the adversarial robustness in a zero-shot setting. LAAT uses a text encoder to obtain fixed anchors (normalized feature embeddings) of each category, then uses these anchors to perform adversarial training. The text encoder has the property that semantically similar categories can be mapped to neighboring anchors in the feature space. By leveraging this property, LAAT can make the image model adversarially robust on novel categories without any extra examples. Experimental results show that our method achieves impressive zero-shot adversarial performance, even surpassing the previous state-of-the-art adversarially robust one-shot methods in most attacking settings. When models are trained with LAAT on large datasets like ImageNet-1K, they can have substantial zero-shot adversarial robustness across several downstream datasets.
翻译:深神经网络很容易受到对抗性攻击。 我们考虑在零光图像分类设置中进行对抗性防御, 这一点很少被探讨, 因为对抗性防御和零光学习都具有挑战性。 我们提议LAAT, 一种新颖的语言驱动的、 以Anchor为基础的Aversarial 培训策略, 目的是在零光环境下提高对抗性强度。 LAAT 使用文本编码器来获取每一类的固定锚( 常规特征嵌入), 然后使用这些锚来进行对抗性训练。 文本编码器具有可以将语义相似的类别映射到功能空间的邻近锚的属性。 通过利用这一属性, LAAT 可以使图像模型在新类中充满对抗性强势, 没有任何额外的例子。 实验结果表明, 我们的方法在多数攻击环境中都取得了令人印象深刻的零光度对抗性对抗性一发效果, 甚至超过了先前的状态的对抗性强一分法方法。 当模型在像 Net-1K 这样的大型数据集上与LAAT 培训时, 它们可以在多个下游数据集上具有相当的零射速的对抗性对抗性对抗性。