Vision transformers have shown excellent performance in computer vision tasks. As the computation cost of their self-attention mechanism is expensive, recent works tried to replace the self-attention mechanism in vision transformers with convolutional operations, which is more efficient with built-in inductive bias. However, these efforts either ignore multi-level features or lack dynamic prosperity, leading to sub-optimal performance. In this paper, we propose a Dynamic Multi-level Attention mechanism (DMA), which captures different patterns of input images by multiple kernel sizes and enables input-adaptive weights with a gating mechanism. Based on DMA, we present an efficient backbone network named DMFormer. DMFormer adopts the overall architecture of vision transformers, while replacing the self-attention mechanism with our proposed DMA. Extensive experimental results on ImageNet-1K and ADE20K datasets demonstrated that DMFormer achieves state-of-the-art performance, which outperforms similar-sized vision transformers(ViTs) and convolutional neural networks (CNNs).
翻译:视觉变压器在计算机视觉任务中表现良好。 由于计算其自我注意机制的成本成本昂贵,最近的工作试图用进化操作取代视觉变压器中的自我注意机制,这种变压器更有效率,但这种努力要么忽视多层次特征,要么缺乏动态繁荣,导致亚优性性能。在本文中,我们提议一个动态多层次注意机制,通过多个内核尺寸捕捉不同输入图像模式,并用一个格子机制进行输入-适应加权。在DMA的基础上,我们提出了一个名为DMFFormer的高效主干网。DMFormer采用视觉变压器的总体结构,同时用我们提议的DMA取代自我注意机制。关于图像Net-1K和ADE20K数据集的广泛实验结果表明,DMFormer公司取得了超越了类似规模的视觉变压器和变压器网络的状态性能。