Masked Image Modeling (MIM) has been a prevailing framework for self-supervised visual representation learning. Within the pretraining-finetuning paradigm, the MIM framework trains an encoder by reconstructing masked image patches with the help of a decoder which would be abandoned when the encoder is used for finetuning. Despite its state-of-the-art performance on clean images, MIM models are vulnerable to adversarial attacks, limiting its real-world application, and few studies have focused on this issue. In this paper, we have discovered that noisy image modeling (NIM), a variant of MIM that uses denoising as the pre-text task, provides not only good pretrained visual features, but also effective adversarial defense for downstream models. To achieve a better accuracy-robustness trade-off, we further propose to sample the hyperparameter that controls the reconstruction difficulty from random distributions instead of setting it globally, and fine-tune downstream networks with denoised images. Experimental results demonstrate that our pre-trained denoising autoencoders are effective against different white-box, gray-box, and black-box attacks without being trained with adversarial images, while not harming the clean accuracy of fine-tuned models. Source code and models will be made available.
翻译:蒙面图像模型(MIM)一直是自我监督的视觉模拟学习的主导框架。 在培训前调整模式中,MIM框架通过在编码器用于微调时废弃的解码器的帮助下重建蒙面图像补丁来训练一个编码器。尽管其最先进的功能是清洁图像,但MIM模型很容易受到对抗性攻击,限制了其真实世界应用,而且很少有研究关注这一问题。在本文中,我们发现噪音图像模型(NIM)是MIM的变种,它使用淡化作为预文本任务,不仅提供良好的预先训练的视觉特征,而且为下游模型提供有效的对抗性防御。为了实现更好的准确性-粗野度交易,我们进一步建议抽样超常参数,以控制随机传播而不是设置全球的重建困难,以及带有淡化图像的微调调低调下游网络。实验结果显示,我们经过培训的自动解调的图像模型(NIM)不仅对不同的白箱、灰色箱、黑箱和黑箱攻击的精确性模型有效,而且没有经过训练的精确性模型。