Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective. However, customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT. More importantly, since these hierarchical ViTs cannot reuse the off-the-shelf pre-trained weights of the plain ViTs, the requirement of pre-training them leads to a massive amount of computational cost, thereby incurring both algorithmic and computational complexity. In this paper, we address this problem by proposing a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training. We transform the plain ViT into a hierarchical one with minimal changes. Technically, we change the stride of linear embedding layer from 16 to 4 and add convolution (or simple average) pooling layers between the transformer blocks, thereby reducing the feature size from 1/4 to 1/32 sequentially. Despite its simplicity, it outperforms the plain ViT baseline in classification, detection, and segmentation tasks on ImageNet, MS COCO, Cityscapes, and ADE20K benchmarks, respectively. We hope this preliminary study could draw more attention from the community on developing effective (hierarchical) ViTs while avoiding the pre-training cost by leveraging the off-the-shelf checkpoints. The code and models will be released at https://github.com/ViTAE-Transformer/HPViT.
翻译:训练前自监督前通过蒙面图像模型(MIM)进行自我监督的视觉变压器(ViT)已被证明非常有效。然而,应当为等级ViT(例如GreenMIM)精心设计定制的算法,而不是为普通ViT使用香草和简单的MAE。更重要的是,由于这些等级ViT(ViT)无法再利用普通ViT(Speoply ViT)的现成的预训练前重量,因此,预培训要求导致大量计算成本,从而产生算法和计算复杂性。在本文中,我们应对这一问题的方式是提出一个新的想法,将等级结构结构设计从自监督前的ViT(GreenMIM)拆分解为等级结构设计。我们把普通ViT(ViLiL)变成一个等级,但变化最小。在技术上,我们无法将线性嵌入层从16到4级,并增加变压(或简单平均数)变压层层,从而将特征大小从1/4降至1/32。尽管其简单化为简单,我们在ViT(ViLeb-com) 递平级的递平级的校/COLiLiListria-com-com-com-comtrax-com-com-com-com-com-trax-trax-com-trax-trax-lex-traxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx