Unsupervised learning of vision transformers seeks to pretrain an encoder via pretext tasks without labels. Among them is the Masked Image Modeling (MIM) aligned with pretraining of language transformers by predicting masked patches as a pretext task. A criterion in unsupervised pretraining is the pretext task needs to be sufficiently hard to prevent the transformer encoder from learning trivial low-level features not generalizable well to downstream tasks. For this purpose, we propose an Adversarial Positional Embedding (AdPE) approach -- It distorts the local visual structures by perturbing the position encodings so that the learned transformer cannot simply use the locally correlated patches to predict the missing ones. We hypothesize that it forces the transformer encoder to learn more discriminative features in a global context with stronger generalizability to downstream tasks. We will consider both absolute and relative positional encodings, where adversarial positions can be imposed both in the embedding mode and the coordinate mode. We will also present a new MAE+ baseline that brings the performance of the MIM pretraining to a new level with the AdPE. The experiments demonstrate that our approach can improve the fine-tuning accuracy of MAE by $0.8\%$ and $0.4\%$ over 1600 epochs of pretraining ViT-B and ViT-L on Imagenet1K. For the transfer learning task, it outperforms the MAE with the ViT-B backbone by $2.6\%$ in mIoU on ADE20K, and by $3.2\%$ in AP$^{bbox}$ and $1.6\%$ in AP$^{mask}$ on COCO, respectively. These results are obtained with the AdPE being a pure MIM approach that does not use any extra models or external datasets for pretraining. The code is available at https://github.com/maple-research-lab/AdPE.
翻译:不受监督地学习视觉变压器的目的是通过没有标签的借口任务预设一个编码器。其中之一是与语言变压器预培训相一致的蒙面图像模型(MIM),通过预测遮面补丁来进行预培训。一个不受监督的预培训标准是,借口任务必须足够困难,以防止变压器编码器学习微不足道的低层次特征,但不能普遍适用于下游任务。为此,我们提议采用一个“以美元为基底的基底值嵌入(AdPE)”方法。它通过渗透位置编码来扭曲本地的视觉结构,使学习过的变压图像模型无法简单地使用与当地相关的补丁来预测缺失的。我们假设它迫使变压器 mcoder 在全球环境中学习更多歧视特征,在下游任务之前,我们将会考虑绝对和相对位置的编码,在嵌入模式和坐标模式下,我们还将推出一个新的MAE+基线,在ViPO美元为美元为美元,在MAO-BROT前的测试中,在16个基底级数据上,在ADPE上,可以改进MAPE的精确度上,通过APE。</s>