Visual recognition has been dominated by convolutionalneural networks (CNNs) for years. Though recently the pre-vailing vision transformers (ViTs) have shown great poten-tial of self-attention based models in ImageNet classifica-tion, their performance is still inferior to latest SOTA CNNsif no extra data are provided. In this work, we aim to closethe performance gap and demonstrate that attention-basedmodels are indeed able to outperform CNNs. We found thatthe main factor limiting the performance of ViTs for Ima-geNet classification is their low efficacy in encoding fine-level features into the token representations. To resolvethis, we introduce a noveloutlook attentionand present asimple and general architecture, termed Vision Outlooker(VOLO). Unlike self-attention that focuses on global depen-dency modeling at a coarse level, the outlook attention aimsto efficiently encode finer-level features and contexts intotokens, which are shown to be critical for recognition per-formance but largely ignored by the self-attention. Experi-ments show that our VOLO achieves 87.1% top-1 accuracyon ImageNet-1K classification, being the first model exceed-ing 87% accuracy on this competitive benchmark, withoutusing any extra training data. In addition, the pre-trainedVOLO transfers well to downstream tasks, such as seman-tic segmentation. We achieve 84.3% mIoU score on thecityscapes validation set and 54.3% on the ADE20K valida-tion set. Code is available at https://github.com/sail-sg/volo.
翻译:多年来,视觉认知一直以革命网络(CNNs)为主。尽管最近使用前的视觉变压器(ViTs)在图像网络类别中展示了基于自我关注的模型的极强陶器,但其性能仍然低于最新的SOTA CNNsif 没有提供额外数据。在这项工作中,我们的目标是缩小性能差距,并表明基于关注的模型确实能够超越CNN。我们发现,Ima-GeNet分类限制VITs业绩的主要因素是它们将微调功能编码为象征性表示式表达式的低效率。为了解决这个问题,我们引入了一种新颖的外观关注和呈现为简单和一般结构,称为Vision Outlooker(VOuter(VOLO) 。不同于侧重于全球依赖性建模前水平的自我关注, 展望关注的目标是高效率地将精细级的特征和背景编码化为20, 事实证明,这对于现有性能的识别能力十分关键,但基本上被自动忽略。 Experial-lamental com real laction on the VOLO-deal realalalalal laction adeal dal dalation salation laction.