Visual domain adaptation (DA) seeks to transfer trained models to unseen, unlabeled domains across distribution shift, but approaches typically focus on adapting convolutional neural network architectures initialized with supervised ImageNet representations. In this work, we shift focus to adapting modern architectures for object recognition -- the increasingly popular Vision Transformer (ViT) -- and modern pretraining based on self-supervised learning (SSL). Inspired by the design of recent SSL approaches based on learning from partial image inputs generated via masking or cropping -- either by learning to predict the missing pixels, or learning representational invariances to such augmentations -- we propose PACMAC, a simple two-stage adaptation algorithm for self-supervised ViTs. PACMAC first performs in-domain SSL on pooled source and target data to learn task-discriminative features, and then probes the model's predictive consistency across a set of partial target inputs generated via a novel attention-conditioned masking strategy, to identify reliable candidates for self-training. Our simple approach leads to consistent performance gains over competing methods that use ViTs and self-supervised initializations on standard object recognition benchmarks. Code available at https://github.com/virajprabhu/PACMAC
翻译:视觉域适应(DA)试图将经过培训的模型转移到看不见的、未贴标签的分布式转换领域,但方法通常侧重于调整以监管的图像网络演示形式初始化的进化神经网络结构。在这项工作中,我们把重点转向调整现代结构,以进行物体识别 -- -- 日益受欢迎的视觉变异器(VIT)和基于自我监督学习(SSL)的现代前期培训。在设计最新的SSL方法的启发下,通过学习通过遮蔽或裁剪产生的部分图像输入 -- -- 要么通过学习预测缺失的像素,要么通过学习这种扩增的表达式变异 -- -- 我们建议PACMACMAC,这是一个简单的两阶段适应算法,用于自我监督Vit-Vits。 PACMAC首先在常规源和目标数据中进行演练,以学习任务差异性特征,然后探索该模型在通过新式关注的遮蔽式掩码战略产生的一组部分目标输入的预测一致性,以便确定可靠的自我培训对象。我们简单的方法导致在使用Vit-PHUPA/自我监督的初始识别标准标准标准识别基准上竞合方法取得一致的业绩收益。