As a de facto solution, the vanilla Vision Transformers (ViTs) are encouraged to model long-range dependencies between arbitrary image patches while the global attended receptive field leads to quadratic computational cost. Another branch of Vision Transformers exploits local attention inspired by CNNs, which only models the interactions between patches in small neighborhoods. Although such a solution reduces the computational cost, it naturally suffers from small attended receptive fields, which may limit the performance. In this work, we explore effective Vision Transformers to pursue a preferable trade-off between the computational complexity and size of the attended receptive field. By analyzing the patch interaction of global attention in ViTs, we observe two key properties in the shallow layers, namely locality and sparsity, indicating the redundancy of global dependency modeling in shallow layers of ViTs. Accordingly, we propose Multi-Scale Dilated Attention (MSDA) to model local and sparse patch interaction within the sliding window. With a pyramid architecture, we construct a Multi-Scale Dilated Transformer (DilateFormer) by stacking MSDA blocks at low-level stages and global multi-head self-attention blocks at high-level stages. Our experiment results show that our DilateFormer achieves state-of-the-art performance on various vision tasks. On ImageNet-1K classification task, DilateFormer achieves comparable performance with 70% fewer FLOPs compared with existing state-of-the-art models. Our DilateFormer-Base achieves 85.6% top-1 accuracy on ImageNet-1K classification task, 53.5% box mAP/46.1% mask mAP on COCO object detection/instance segmentation task and 51.1% MS mIoU on ADE20K semantic segmentation task.
翻译:作为事实上的解决方案,鼓励香草愿景变异器(VVITs)建模任意图像补丁之间的长距离依赖性,而全球参与的可接受字段则导致二次计算成本。另一个视野变异器分支利用了受CNN启发的当地关注,而CNN只是模拟小邻居补丁之间的互动。虽然这种解决方案可以降低计算成本,但自然会受到小规模的可接受字段的影响,这可能会限制性能。在这项工作中,我们探索有效的U愿景变异器,以便在所接受的可接收字段的计算复杂性和大小之间实现更佳的权衡。通过分析VITs全球关注的补差互动,我们观察到了浅层的两个关键属性,即地点和偏差,这表明全球依赖性建模模式在微小邻居之间的相互作用。因此,我们建议多层次的调异调关注(MCDD)在滑动窗口内建模本地和稀薄的补丁。我们用多级的图像变异模型(DlFDRFA),在低级的 OFAFADLS,在高层次上建了一个多级的自动任务。