Vision Transformer (ViT) has shown great potential for various visual tasks due to its ability to model long-range dependency. However, ViT requires a large amount of computing resource to compute the global self-attention. In this work, we propose a ladder self-attention block with multiple branches and a progressive shift mechanism to develop a light-weight transformer backbone that requires less computing resources (e.g. a relatively small number of parameters and FLOPs), termed Progressive Shift Ladder Transformer (PSLT). First, the ladder self-attention block reduces the computational cost by modelling local self-attention in each branch. In the meanwhile, the progressive shift mechanism is proposed to enlarge the receptive field in the ladder self-attention block by modelling diverse local self-attention for each branch and interacting among these branches. Second, the input feature of the ladder self-attention block is split equally along the channel dimension for each branch, which considerably reduces the computational cost in the ladder self-attention block (with nearly 1/3 the amount of parameters and FLOPs), and the outputs of these branches are then collaborated by a pixel-adaptive fusion. Therefore, the ladder self-attention block with a relatively small number of parameters and FLOPs is capable of modelling long-range interactions. Based on the ladder self-attention block, PSLT performs well on several vision tasks, including image classification, objection detection and person re-identification. On the ImageNet-1k dataset, PSLT achieves a top-1 accuracy of 79.9% with 9.2M parameters and 1.9G FLOPs, which is comparable to several existing models with more than 20M parameters and 4G FLOPs. Code is available at https://isee-ai.cn/wugaojie/PSLT.html.
翻译:视觉Transformer(ViT)由于其能够建模长程依赖而在各种视觉任务中表现出巨大潜力。然而,ViT需要大量计算资源来计算全局自注意力。在本文中,我们提出了一种带有多个分支的Ladder自注意力块和一种渐进位移机制的梯级自注意力块,从而开发了一种轻量级Transformer骨干网络,它需要较少的计算资源(例如,相对较少的参数和FLOPs),称为Progressive Shift Ladder Transformer(PSLT)。首先,梯级自注意力块通过在每个分支中建模本地自注意力来减少计算成本。同时,提出了渐进位移机制,通过在分支中建模多样的本地自注意力并在这些分支之间交互来扩大梯级自注意力块中的感受野。其次,梯级自注意力块的输入特征沿通道维度均匀分割给每个分支,这极大地减少了梯级自注意力块中的计算成本(几乎是参数和FLOPs的1/3),然后这些分支的输出通过像素自适应融合进行协作。因此,具有相对较少参数和FLOPs的梯级自注意力块能够建模长程交互。基于梯级自注意力块,PSLT在几个视觉任务中表现良好,包括图像分类、目标检测和人员重新识别。在ImageNet-1k数据集上,PSLT以9.2M参数和1.9G FLOPs的性能获得了79.9%的top-1准确率,与多个具有超过20M参数和4G FLOPs的现有模型相当。代码可在https://isee-ai.cn/wugaojie/PSLT.html获得。