The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually "modernize" a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.
翻译:视觉识别的“Rogaring 20s”始于引入视觉变异器(ViTs),该变异器迅速取代了ConvNets,成为最先进的图像分类模型。另一方面,香草ViT在应用普通计算机视觉任务,如物体探测和语义分解时面临困难。是等级变异器(例如Swin变异器)重新引入了多个ConvNet前科,使变异器作为一个通用的愿景主干线具有实际可行性,并展示了各种愿景任务方面的杰出表现。然而,这种混合方法的效力在很大程度上仍归功于变异器的内在优势,而不是共变异的内在感偏见。在这项工作中,我们重新审查了设计空间并测试了纯的ConvNet所能达到的极限。我们逐渐“现代化”的标准ResNet, 以设计变异器的形式发现了几个关键组成部分, 沿着各种愿景任务展示了业绩差异。这次探索的结果是,一个纯的ConveNet模型组合,将ConvNEXt 和Nevelrial的精准度定位,同时将可完全从标准的Conval-leval-leval-lational deal decal decal decreal decal deal decal decal decal decal dal dismal dismal sqmal baldal lemaldaldaldaldal demald sal lexmal lemmaldal lex sqmal sqmaldaldaldaldaldaldaldaldaldaldaldmaldmaldald lectionsmmmmmmmmald smmmmmmmmmmmmmmmmal led smmal ledal ledalddaldaldal ledaldaldaldaldaldaldaldaldaldaldaldal ledaldaldal ledaldaldal ledal ledaldaldaldaldaldaldaldaldaldal ledaldaldaldal