Vision transformers (ViTs) are usually considered to be less light-weight than convolutional neural networks (CNNs) due to the lack of inductive bias. Recent works thus resort to convolutions as a plug-and-play module and embed them in various ViT counterparts. In this paper, we argue that the convolutional kernels perform information aggregation to connect all tokens; however, they would be actually unnecessary for light-weight ViTs if this explicit aggregation could function in a more homogeneous way. Inspired by this, we present LightViT as a new family of light-weight ViTs to achieve better accuracy-efficiency balance upon the pure transformer blocks without convolution. Concretely, we introduce a global yet efficient aggregation scheme into both self-attention and feed-forward network (FFN) of ViTs, where additional learnable tokens are introduced to capture global dependencies; and bi-dimensional channel and spatial attentions are imposed over token embeddings. Experiments show that our model achieves significant improvements on image classification, object detection, and semantic segmentation tasks. For example, our LightViT-T achieves 78.7% accuracy on ImageNet with only 0.7G FLOPs, outperforming PVTv2-B0 by 8.2% while 11% faster on GPU. Code is available at https://github.com/hunto/LightViT.
翻译:视觉变压器( ViTs) 通常被认为轻轻轻轻轻轻轻轻轻轻轻轻轻轻轻轻轻轻的,因为缺乏感官偏差。 最近的作品因此将熔化作为插件和游戏模块, 并将其嵌入不同的 ViT 。 在本文中, 我们争论说, 卷动内核会进行信息聚合, 连接所有象征物; 但是, 如果能够以更均匀的方式对轻轻量的 ViT 进行明确的聚合, 则对轻量的ViT 来说实际上是不必要的。 受此启发, 我们将 LightVT 展示成一个新的轻量的光重 ViT 组合, 以在纯变换变变变变变的变变变变的变换器块上实现更精准的平衡。 具体地说, 我们将一个全球性但高效的组合组合计划引入 ViT 的自我关注和进化网络( FFN), 引入额外的可学习符号来捕捉取全球依赖性; 双维通道和空间关注点被强加过象征性嵌嵌嵌。 实验显示我们的模型在图像分类、 对象检测和图像图解上只有 78/ IMVILVIL%B 的精确度 。 。 例如, 我们的图像的精确度仅 。 。