Positional encoding is important for vision transformer (ViT) to capture the spatial structure of the input image. General effectiveness has been proven in ViT. In our work we propose to train ViT to recognize the positional label of patches of the input image, this apparently simple task actually yields a meaningful self-supervisory task. Based on previous work on ViT positional encoding, we propose two positional labels dedicated to 2D images including absolute position and relative position. Our positional labels can be easily plugged into various current ViT variants. It can work in two ways: (a) As an auxiliary training target for vanilla ViT (e.g., ViT-B and Swin-B) for better performance. (b) Combine the self-supervised ViT (e.g., MAE) to provide a more powerful self-supervised signal for semantic feature learning. Experiments demonstrate that with the proposed self-supervised methods, ViT-B and Swin-B gain improvements of 1.20% (top-1 Acc) and 0.74% (top-1 Acc) on ImageNet, respectively, and 6.15% and 1.14% improvement on Mini-ImageNet.
翻译:定位编码对于视觉变压器( Vit) 捕捉输入图像的空间结构非常重要。 通用有效性已经在 Vit 中得到了证明。 我们建议对 Vit 进行 Vit 培训, 以识别输入图像的补丁的定位标签。 在我们为 Vit- B和 Swin- B 进行的培训中, 这个显然简单的任务实际上产生了一个有意义的自我监督任务。 基于 Vit 位置编码的先前工作, 我们建议了两个位置标签, 专门用于 2D 图像, 包括绝对位置和相对位置。 我们的位置标签可以很容易地插入到 Vit 的各种变量中。 它可以通过以下两种方式发挥作用:(a) 作为 Villa Vit (如 Vit-1 B 和 Swin- B) 的辅助培训目标, 以更好地性能。 (b) 将自我监督的 Vit( 如 MAE ) 合并起来, 以提供更强大的自我监督信号, 用于学习 语系特征。 实验表明, 通过拟议的自我监督方法, Vit- B 和 Swin- B 等变换方法, 在图像网络上分别获得1.20% (top-1 A.14) 和 0. 0.74% (to-1 Acc- mini ) 和 mage 分别改进1.