Recently, Transformers have shown promising performance in various vision tasks. To reduce the quadratic computation complexity caused by the global self-attention, various methods constrain the range of attention within a local region to improve its efficiency. Consequently, their receptive fields in a single attention layer are not large enough, resulting in insufficient context modeling. To address this issue, we propose a Pale-Shaped self-Attention (PS-Attention), which performs self-attention within a pale-shaped region. Compared to the global self-attention, PS-Attention can reduce the computation and memory costs significantly. Meanwhile, it can capture richer contextual information under the similar computation complexity with previous local self-attention mechanisms. Based on the PS-Attention, we develop a general Vision Transformer backbone with a hierarchical architecture, named Pale Transformer, which achieves 83.4%, 84.3%, and 84.9% Top-1 accuracy with the model size of 22M, 48M, and 85M respectively for 224 ImageNet-1K classification, outperforming the previous Vision Transformer backbones. For downstream tasks, our Pale Transformer backbone performs better than the recent state-of-the-art CSWin Transformer by a large margin on ADE20K semantic segmentation and COCO object detection & instance segmentation. The code will be released on https://github.com/BR-IDL/PaddleViT.
翻译:最近,转型者在各种愿景任务中表现出了有希望的绩效。为了减少全球自我关注造成的二次计算复杂性,各种方法限制了本地区域内部的注意力范围,以提高其效率。因此,在单一关注层中,它们容纳的字段不够大,导致环境模型不完善。为了解决这一问题,我们提议采用“Pale-Shaped自我保护”(PS-Atention),在浅色区域进行自我关注。与全球自我关注相比,PS-Atention可以大幅降低计算和记忆成本。与此同时,它可以利用先前的本地自我关注机制,在类似的计算复杂性下获取更丰富的背景信息。基于PS-Atention,我们开发了一个通用的视野变异器主干,名为Pale-Shape-Shape-Shanederformation(Pale-Shape-Shape-Shape-Shailationer),其模型规模分别为22M、48M和85MMeg-Megration Net-K分类,这比先前的视野变异器-CO-CO-ADLVER-SDLServard Syal seal seal laction-destreval laction-C-deal lader-de-de-de-deal laction laction laction laction laction lautefal laction lautefer lauteber lautef-C-de lautefal laction later laction later later later laute later later later later laute later later lader later laute later later later later later later la be later later later later later later later later later later later later later later later laut laut laut la la la la la la la la la la la la la la la la la la la la la la la la