While originally designed for natural language processing tasks, the self-attention mechanism has recently taken various computer vision areas by storm. However, the 2D nature of images brings three challenges for applying self-attention in computer vision. (1) Treating images as 1D sequences neglects their 2D structures. (2) The quadratic complexity is too expensive for high-resolution images. (3) It only captures spatial adaptability but ignores channel adaptability. In this paper, we propose a novel linear attention named large kernel attention (LKA) to enable self-adaptive and long-range correlations in self-attention while avoiding its shortcomings. Furthermore, we present a neural network based on LKA, namely Visual Attention Network (VAN). While extremely simple, VAN surpasses similar size vision transformers(ViTs) and convolutional neural networks(CNNs) in various tasks, including image classification, object detection, semantic segmentation, panoptic segmentation, pose estimation, etc. For example, VAN-B6 achieves 87.8% accuracy on ImageNet benchmark and set new state-of-the-art performance (58.2 PQ) for panoptic segmentation. Besides, VAN-B2 surpasses Swin-T 4% mIoU (50.1 vs. 46.1) for semantic segmentation on ADE20K benchmark, 2.6% AP (48.8 vs. 46.2) for object detection on COCO dataset. It provides a novel method and a simple yet strong baseline for the community. Code is available at https://github.com/Visual-Attention-Network.
翻译:虽然最初设计用于自然语言处理任务,但自留机制最近通过风暴采取了各种计算机视觉领域,然而,图像的2D性质给在计算机视觉中应用自留带来了三个挑战。 (1) 将图像视为1D序列忽视了2D结构。 (2) 二次复杂度对于高分辨率图像来说太昂贵。 (3) 二次复杂度对于高分辨率图像来说太昂贵了。 (3) 它只包含空间适应性,而忽视频道的适应性。 在本文中,我们提议一种新颖的线性关注点命名为大内核关注(LKA),以便能够在自留时实现自我适应性和远程关联,同时避免其缺陷。 此外,我们展示了一个基于LKA的神经网络网络网络网络网络网络,即视觉关注网络网络。 虽然非常简单, VAN超过类似规模的视觉变异器(VT) 和演动神经网络网络网络网络网络,包括图像分类、物体探测、电路段分解、表估计等。 VAN-B6在图像网络基准基准上达到87.8%的精确度基准,并设置新的状态-目标目标目标目标基准值2,SA+50的SOlex-SOlex-S.