PSLT: 一种带Ladder自注意力和渐进位移的轻量级视觉Transformer (PSLT: A Light-weight Vision Transformer with Ladder Self-Attention and Progressive Shift)

Vision Transformer (ViT) has shown great potential for various visual tasks due to its ability to model long-range dependency. However, ViT requires a large amount of computing resource to compute the global self-attention. In this work, we propose a ladder self-attention block with multiple branches and a progressive shift mechanism to develop a light-weight transformer backbone that requires less computing resources (e.g. a relatively small number of parameters and FLOPs), termed Progressive Shift Ladder Transformer (PSLT). First, the ladder self-attention block reduces the computational cost by modelling local self-attention in each branch. In the meanwhile, the progressive shift mechanism is proposed to enlarge the receptive field in the ladder self-attention block by modelling diverse local self-attention for each branch and interacting among these branches. Second, the input feature of the ladder self-attention block is split equally along the channel dimension for each branch, which considerably reduces the computational cost in the ladder self-attention block (with nearly 1/3 the amount of parameters and FLOPs), and the outputs of these branches are then collaborated by a pixel-adaptive fusion. Therefore, the ladder self-attention block with a relatively small number of parameters and FLOPs is capable of modelling long-range interactions. Based on the ladder self-attention block, PSLT performs well on several vision tasks, including image classification, objection detection and person re-identification. On the ImageNet-1k dataset, PSLT achieves a top-1 accuracy of 79.9% with 9.2M parameters and 1.9G FLOPs, which is comparable to several existing models with more than 20M parameters and 4G FLOPs. Code is available at https://isee-ai.cn/wugaojie/PSLT.html.

翻译：视觉Transformer（ViT）由于其能够建模长程依赖而在各种视觉任务中表现出巨大潜力。然而，ViT需要大量计算资源来计算全局自注意力。在本文中，我们提出了一种带有多个分支的Ladder自注意力块和一种渐进位移机制的梯级自注意力块，从而开发了一种轻量级Transformer骨干网络，它需要较少的计算资源（例如，相对较少的参数和FLOPs），称为Progressive Shift Ladder Transformer（PSLT）。首先，梯级自注意力块通过在每个分支中建模本地自注意力来减少计算成本。同时，提出了渐进位移机制，通过在分支中建模多样的本地自注意力并在这些分支之间交互来扩大梯级自注意力块中的感受野。其次，梯级自注意力块的输入特征沿通道维度均匀分割给每个分支，这极大地减少了梯级自注意力块中的计算成本（几乎是参数和FLOPs的1/3），然后这些分支的输出通过像素自适应融合进行协作。因此，具有相对较少参数和FLOPs的梯级自注意力块能够建模长程交互。基于梯级自注意力块，PSLT在几个视觉任务中表现良好，包括图像分类、目标检测和人员重新识别。在ImageNet-1k数据集上，PSLT以9.2M参数和1.9G FLOPs的性能获得了79.9％的top-1准确率，与多个具有超过20M参数和4G FLOPs的现有模型相当。代码可在https://isee-ai.cn/wugaojie/PSLT.html获得。

相关内容

自注意力

关注 13

利用注意力机制来“动态”地生成不同连接的权重，这就是自注意力模型（Self-Attention Model）. 注意力机制模仿了生物观察行为的内部过程，即一种将内部经验和外部感觉对齐从而增加部分区域的观察精细度的机制。注意力机制可以快速提取稀疏数据的重要特征，因而被广泛用于自然语言处理任务，特别是机器翻译。而自注意力机制是注意力机制的改进，其减少了对外部信息的依赖，更擅长捕捉数据或特征的内部相关性

Transformer 落地出现 | Next-ViT实现工业TensorRT实时落地，超越ResNet、CSWin

专知会员服务

22+阅读 · 2022年7月19日

【ECCV2022】UniNet:具有卷积、Transformer和MLP的统一架构搜索

专知会员服务

30+阅读 · 2022年7月15日

【CVPR 2022】NUS&字节跳动提出Shunted Transformer：多尺度Token叠加

专知会员服务

16+阅读 · 2022年4月8日

【CVPR 2022】基于实例深度估计的统一深度感知全景分割 PanopticDepth: Per-Instance Depth Estimation for Unified Depth-Aware Panoptic Segmentation

专知会员服务

18+阅读 · 2022年3月19日