Deep neural networks are known to be susceptible to adversarial attacks. In this work, we focus on improving adversarial robustness in the challenging zero-shot image classification setting. To address this issue, we propose LAAT, a novel Language-driven, Anchor-based Adversarial Training strategy. LAAT utilizes a text encoder to generate fixed anchors (normalized feature embeddings) for each category and then uses these anchors for adversarial training. By leveraging the semantic consistency of the text encoders, LAAT can enhance the adversarial robustness of the image model on novel categories without additional examples. We identify the large cosine similarity problem of recent text encoders and design several effective techniques to address it. The experimental results demonstrate that LAAT significantly improves zero-shot adversarial performance, outperforming previous state-of-the-art adversarially robust one-shot methods. Moreover, our method produces substantial zero-shot adversarial robustness when models are trained on large datasets such as ImageNet-1K and applied to several downstream datasets.
翻译:基于语言驱动的零样本对抗鲁棒性Anchors
翻译后的摘要:
深度神经网络容易受到对抗性攻击。在这项工作中,我们主要关注如何在具有挑战性的零样本图像分类情况下提高对抗鲁棒性。为了解决这个问题,我们提出了LAAT,这是一种基于语言、锚点的对抗训练策略。LAAT利用文本编码器为每个类别生成固定的锚点(规范化特征嵌入),然后使用这些锚点进行对抗训练。通过利用文本编码器的语义一致性,LAAT可以提高图像模型在新类别上的对抗鲁棒性,而不需要额外的样本。我们发现最近文本编码器的余弦相似度问题很大,并设计了几种有效的技术来解决它。实验结果表明,LAAT显著提高了零样本对抗性能,优于先前最先进的对抗鲁棒单次方法。此外,我们的方法在训练大规模数据集(如ImageNet-1K)并应用于多个下游数据集时,产生了实质性的零样本对抗鲁棒性。