Recently, masked image modeling (MIM) has offered a new methodology of self-supervised pre-training of vision transformers. A key idea of efficient implementation is to discard the masked image patches (or tokens) throughout the target network (encoder), which requires the encoder to be a plain vision transformer (e.g., ViT), albeit hierarchical vision transformers (e.g., Swin Transformer) have potentially better properties in formulating vision inputs. In this paper, we offer a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT) that enjoys both high efficiency and good performance in MIM. The key is to remove the unnecessary "local inter-unit operations", deriving structurally simple hierarchical vision transformers in which mask-units can be serialized like plain vision transformers. For this purpose, we start with Swin Transformer and (i) set the masking unit size to be the token size in the main stage of Swin Transformer, (ii) switch off inter-unit self-attentions before the main stage, and (iii) eliminate all operations after the main stage. Empirical studies demonstrate the advantageous performance of HiViT in terms of fully-supervised, self-supervised, and transfer learning. In particular, in running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$\times$ speed-up over Swin-B, and the performance gain generalizes to downstream tasks of detection and segmentation. Code will be made publicly available.
翻译:最近,蒙面图像建模(MIM)为视觉变异器的自我监督前训练提供了一个新方法。高效实施的一个关键想法是丢弃整个目标网络(编码器)的遮面图像补丁(或图示),它要求编码器成为简单的视觉变异器(例如ViT),尽管高层次的视觉变异器(例如Swin变异器)在形成视觉输入时具有更好的特性。在本文中,我们提供了一个新的名为HiVIT(高级智能变异器)的高级网络变异器设计,在MIM中既具有较高的效率和良好的性能。关键是要消除不必要的“当地跨单位操作”的掩面图像补印(或图示器),在这种变异形变变变变器中,虽然高层次的变异器(例如Swin变异器)具有更好的特性。为此,我们从Swin变异器开始,并(一)将掩码单位大小设定为Swin变异器的主要阶段的象征值,(二)在主阶段前将内部自留值转换为高精度的自我,在单位间操作中,在SVialS-S-S-VialS-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-I-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-