Recent advances in vision transformers (ViTs) have achieved great performance in visual recognition tasks. Convolutional neural networks (CNNs) exploit spatial inductive bias to learn visual representations, but these networks are spatially local. ViTs can learn global representations with their self-attention mechanism, but they are usually heavy-weight and unsuitable for mobile devices. In this paper, we propose cross feature attention (XFA) to bring down computation cost for transformers, and combine efficient mobile CNNs to form a novel efficient light-weight CNN-ViT hybrid model, XFormer, which can serve as a general-purpose backbone to learn both global and local representation. Experimental results show that XFormer outperforms numerous CNN and ViT-based models across different tasks and datasets. On ImageNet1K dataset, XFormer achieves top-1 accuracy of 78.5% with 5.5 million parameters, which is 2.2% and 6.3% more accurate than EfficientNet-B0 (CNN-based) and DeiT (ViT-based) for similar number of parameters. Our model also performs well when transferring to object detection and semantic segmentation tasks. On MS COCO dataset, XFormer exceeds MobileNetV2 by 10.5 AP (22.7 -> 33.2 AP) in YOLOv3 framework with only 6.3M parameters and 3.8G FLOPs. On Cityscapes dataset, with only a simple all-MLP decoder, XFormer achieves mIoU of 78.5 and FPS of 15.3, surpassing state-of-the-art lightweight segmentation networks.
翻译:视觉变压器(ViTs)的近期进步在视觉识别任务中取得了巨大成绩。 革命神经网络(CNNNs)利用空间感化偏差来学习视觉表现, 但是这些网络在空间上是局部的。 ViTs可以用自己的自我注意机制学习全球表现方式, 但是它们通常重量过重, 不适合移动设备。 在本文中, 我们建议交叉关注( XFA) 降低变压器的计算成本, 并将高效的移动CNN- CNN- VIT混合模型( XFormer) 组合起来, 以形成一个新的高效的轻巧的轻巧的CNN- PN- ViT混合模型( XFormer), 它可以作为学习全球和本地代表的通用主干线。 实验结果显示, XFFormer 超越了许多CNN 和VT 的模型, 跨越了不同任务和数据集。 在图像Net1K 数据集中, XFFNPS2 的上, 最高一级精确度为2.2% 和6.VPSeal- seral- CO3 数据框架。