Vision Transformers (ViTs) have been shown to be effective in various vision tasks. However, resizing them to a mobile-friendly size leads to significant performance degradation. Therefore, developing lightweight vision transformers has become a crucial area of research. This paper introduces CloFormer, a lightweight vision transformer that leverages context-aware local enhancement. CloFormer explores the relationship between globally shared weights often used in vanilla convolutional operators and token-specific context-aware weights appearing in attention, then proposes an effective and straightforward module to capture high-frequency local information. In CloFormer, we introduce AttnConv, a convolution operator in attention's style. The proposed AttnConv uses shared weights to aggregate local information and deploys carefully designed context-aware weights to enhance local features. The combination of the AttnConv and vanilla attention which uses pooling to reduce FLOPs in CloFormer enables the model to perceive high-frequency and low-frequency information. Extensive experiments were conducted in image classification, object detection, and semantic segmentation, demonstrating the superiority of CloFormer.
翻译:视觉Transformer(ViT)已经被证明在各种视觉任务中是有效的。然而,将它们调整到移动友好的大小会导致显着的性能下降。因此,开发轻量级视觉Transformer已成为研究的重要领域。本文介绍了CloFormer,一种利用上下文感知本地增强的轻量级视觉Transformer。CloFormer探索了纯卷积算子中经常使用的全局共享权重与注意力中出现的特定于令牌的上下文感知权重之间的关系,然后提出了一个有效且简单的模块来捕捉高频本地信息。在CloFormer中,我们引入了一种以注意力风格为特点的卷积算子AttnConv。所提出的AttnConv使用共享权重来聚合本地信息,并部署精心设计的上下文感知权重来增强本地特征。在CloFormer中AttnConv和使用池化来减少FLOPs的基础注意力的组合使得该模型能够感知高频和低频信息。在图像分类、目标检测和语义分割方面进行了大量实验,证明了CloFormer的优越性。