Vision transformers have shown excellent performance in computer vision tasks. However, the computation cost of their (local) self-attention mechanism is expensive. Comparatively, CNN is more efficient with built-in inductive bias. Recent works show that CNN is promising to compete with vision transformers by learning their architecture design and training protocols. Nevertheless, existing methods either ignore multi-level features or lack dynamic prosperity, leading to sub-optimal performance. In this paper, we propose a novel attention mechanism named MCA, which captures different patterns of input images by multiple kernel sizes and enables input-adaptive weights with a gating mechanism. Based on MCA, we present a neural network named ConvFormer. ConvFormer adopts the general architecture of vision transformers, while replacing the (local) self-attention mechanism with our proposed MCA. Extensive experimental results demonstrated that ConvFormer outperforms similar size vision transformers(ViTs) and convolutional neural networks (CNNs) in various tasks. For example, ConvFormer-S, ConvFormer-L achieve state-of-the-art performance of 82.8%, 83.6% top-1 accuracy on ImageNet dataset. Moreover, ConvFormer-S outperforms Swin-T by 1.5 mIoU on ADE20K, and 0.9 bounding box AP on COCO with a smaller model size. Code and models will be available.
翻译:视觉变异器在计算机视觉任务中表现良好。 然而, 视觉变异器在计算机视觉任务中表现出了出色的表现。 然而, 他们的( 本地) 自我注意机制的计算成本是昂贵的。 比较而言, CNN 的计算成本比较而言, CNN 具有内在的感应偏向性。 最近的工作表明, CNN 有希望与视觉变异器竞争, 学习其建筑设计和培训协议。 然而, 现有的方法要么忽视多层次特征, 要么缺乏动态繁荣, 导致亚于最佳的性能。 在本文件中, 我们提议一个新的关注机制, 名为MCA, 以多个内核大小的输入图像模式( 本地), 并使用一个加固机制进行输入适应加权。 在MCA 的基础上, 我们介绍一个名为ConvFormer- Former 的神经网络网络网络网络网络网络网络网络。 Convorformer 采用通用的常规S- 85- formard S- formal- form- formal Formal- dismard Seral- fal- frieward Serma- mard Seral- fal- fal- fal- fal- fal- mard Seral- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- sal- fal- fal- sal- =8xal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- sal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal- fal-