We present an efficient approach for Masked Image Modeling (MIM) with hierarchical Vision Transformers (ViTs), e.g., Swin Transformer, allowing the hierarchical ViTs to discard masked patches and operate only on the visible ones. Our approach consists of two key components. First, for the window attention, we design a Group Window Attention scheme following the Divide-and-Conquer strategy. To mitigate the quadratic complexity of the self-attention w.r.t. the number of patches, group attention encourages a uniform partition that visible patches within each local window of arbitrary size can be grouped with equal size, where masked self-attention is then performed within each group. Second, we further improve the grouping strategy via the Dynamic Programming algorithm to minimize the overall computation cost of the attention on the grouped patches. As a result, MIM now can work on hierarchical ViTs in a green and efficient way. For example, we can train the hierarchical ViTs about 2.7$\times$ faster and reduce the GPU memory usage by 70%, while still enjoying competitive performance on ImageNet classification and the superiority on downstream COCO object detection benchmarks. Code and pre-trained models have been made publicly available at https://github.com/LayneH/GreenMIM.
翻译:我们展示了使用高等级视觉变形器(View Greangers)的蒙面图像模型(MIM)的有效方法,例如Swin 变形器,允许等级的Vits丢弃遮盖的补丁,仅对可见的补丁操作。我们的方法由两个关键部分组成。首先,为了窗口注意,我们设计了一个按照分而治之战略的集团窗口注意方案。为了减少自我注意的四级复杂程度,补丁的数量,团体注意鼓励一个统一的分隔,即每个任意大小的本地窗口中可见的补丁可以以同等大小分组,然后在每个组内进行隐藏的自我留意。第二,我们通过动态方案拟订算法进一步改进分组战略,以尽量减少分组补补补补的总体计算成本。结果,MIM现在可以以绿色和有效的方式处理等级的维特。例如,我们可以对等级ViTs大约2.7美元的时间进行训练,并将GPU记忆的使用减少70%,同时在图像网/网络分类上仍享有竞争性业绩,并在MLAGROM/CS/CREGR Creal Card Basional Card coal Card coards real case coal coal card coal deal deal deal deal deal deal demodudududududustrislationsildal dealdaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldal code coildaldismildaldaldaldaldaldaldaldaldalds codeildaldisildsildsilds codeds) 。