Visual recognition has been dominated by convolutional neural networks (CNNs) for years. Though recently the prevailing vision transformers (ViTs) have shown great potential of self-attention based models in ImageNet classification, their performance is still inferior to that of the latest SOTA CNNs if no extra data are provided. In this work, we try to close the performance gap and demonstrate that attention-based models are indeed able to outperform CNNs. We find a major factor limiting the performance of ViTs for ImageNet classification is their low efficacy in encoding fine-level features into the token representations. To resolve this, we introduce a novel outlook attention and present a simple and general architecture, termed Vision Outlooker (VOLO). Unlike self-attention that focuses on global dependency modeling at a coarse level, the outlook attention efficiently encodes finer-level features and contexts into tokens, which is shown to be critically beneficial to recognition performance but largely ignored by the self-attention. Experiments show that our VOLO achieves 87.1% top-1 accuracy on ImageNet-1K classification, which is the first model exceeding 87% accuracy on this competitive benchmark, without using any extra training data In addition, the pre-trained VOLO transfers well to downstream tasks, such as semantic segmentation. We achieve 84.3% mIoU score on the cityscapes validation set and 54.3% on the ADE20K validation set. Code is available at \url{https://github.com/sail-sg/volo}.
翻译:多年来,视觉认知一直以神经神经网络(CNNs)为主。尽管最近流行的视觉变压器(ViTs)在图像网络分类中表现出了以自我关注为基础的模型的巨大潜力,但是,如果不提供额外数据,其性能仍然低于最新的SOTA有线电视新闻网。在这项工作中,我们试图缩小性能差距,并表明以关注为基础的模型确实能够超过CNN。我们发现一个限制VITs图像网络分类的性能的主要因素是它们将微调级功能编码的低效率。为了解决这个问题,我们引入了新的展望关注,并展示了一个简单和一般的架构,称为VivisionOutorer(VOLO)。与侧重于全球依赖型模型的自我关注程度相比,没有提供额外数据,而是将精细化的功能和背景编成符号。我们VOLOV20网络分类的可用性能为87.1%的顶级/一级精确度。这是第一个模型,在竞争力基准上超过87%的市级数据转换,没有使用任何额外的标准。