We present an efficient approach for Masked Image Modeling (MIM) with hierarchical Vision Transformers (ViTs), allowing the hierarchical ViTs to discard masked patches and operate only on the visible ones. Our approach consists of three key designs. First, for window attention, we propose a Group Window Attention scheme following the Divide-and-Conquer strategy. To mitigate the quadratic complexity of the self-attention w.r.t. the number of patches, group attention encourages a uniform partition that visible patches within each local window of arbitrary size can be grouped with equal size, where masked self-attention is then performed within each group. Second, we further improve the grouping strategy via the Dynamic Programming algorithm to minimize the overall computation cost of the attention on the grouped patches. Third, as for the convolution layers, we convert them to the Sparse Convolution that works seamlessly with the sparse data, i.e., the visible patches in MIM. As a result, MIM can now work on most, if not all, hierarchical ViTs in a green and efficient way. For example, we can train the hierarchical ViTs, e.g., Swin Transformer and Twins Transformer, about 2.7$\times$ faster and reduce the GPU memory usage by 70%, while still enjoying competitive performance on ImageNet classification and the superiority on downstream COCO object detection benchmarks. Code and pre-trained models have been made publicly available at https://github.com/LayneH/GreenMIM.
翻译:我们提出了一个使用高等级视野变异器的隐蔽图像模型(MIM)的高效方法,使等级ViT系统可以丢弃遮盖的遮蔽部分,并且只能在可见的部位运行。我们的方法包括三个关键设计。首先,为了窗口注意,我们提议了一个根据分化和征服战略的集团窗口注意方案。为了减轻自我注意的偏差复杂性,为了减少偏差数量,群体注意鼓励一个统一的分隔,每个任意大小的本地窗口中可见的补丁可以以同等大小分组,然后在每组内进行隐藏的自我注意。第二,我们通过动态程序算法进一步改进组合战略,以尽量减少分组补补补的注意力的总体计算费用。第三,对于卷变层,我们把它们转换成与稀疏的数据(即,MIM的可见补丁补丁)之间无缝的松散。结果是,MIM现在可以以绿色和高效的方式对每个本地的等级 ViT进行分类。第二,我们通过动态程序改进组合战略,通过Schillal ViT的升级和S-Cyal Revyal使用速度,我们可以对70的Scial-deal Styal Syal Syal-deal-deal-deal Stal listeal-deal livaldal adal 70,我们可以对70 和Syal-hal-hal-hal-hal-deal-deal-deal-deal-deal-deal-deal-hal livaldal-deal li) 进行快速进行70 。