Driven by improved architectures and better representation learning frameworks, the field of visual recognition has enjoyed rapid modernization and performance boost in the early 2020s. For example, modern ConvNets, represented by ConvNeXt, have demonstrated strong performance in various scenarios. While these models were originally designed for supervised learning with ImageNet labels, they can also potentially benefit from self-supervised learning techniques such as masked autoencoders (MAE). However, we found that simply combining these two approaches leads to subpar performance. In this paper, we propose a fully convolutional masked autoencoder framework and a new Global Response Normalization (GRN) layer that can be added to the ConvNeXt architecture to enhance inter-channel feature competition. This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets on various recognition benchmarks, including ImageNet classification, COCO detection, and ADE20K segmentation. We also provide pre-trained ConvNeXt V2 models of various sizes, ranging from an efficient 3.7M-parameter Atto model with 76.7% top-1 accuracy on ImageNet, to a 650M Huge model that achieves a state-of-the-art 88.9% accuracy using only public training data.
翻译:在经过改进的架构和更好的代表性学习框架的推动下,视觉识别领域在2020年代初期实现了快速现代化和绩效提升。例如,由ConvNeXt代表的现代ConvNets在各种情景中表现出了很强的绩效。这些模型最初设计用于与图像网络标签一起监督学习,但也有可能受益于自我监督的学习技术,如蒙面自动读数仪(MAE)等。然而,我们发现,仅仅将这两种方法结合起来就会导致次级性能。在本文中,我们提议建立一个完全革命化的蒙面自动coder框架和一个新的全球响应正常化(GRN)层,可以添加到ConvXt结构结构中,以加强频道间地物竞争。这种自上设计的自上而下的学习技术和建筑改进成果,在一个新的模型大家庭中,称为ConvNeXt V2, 大大改进了纯的ConvNet在各种识别基准上的性能,包括图像网络分类、COCO检测和ADE20K分化。我们还提供预先培训的CONNext V2模型,新的全球响应层平整层层,从76M1到最高比例的模型,从1,从高M3.7M到最高的Symaxmmmmmmmmal,从1,从1,从1,从一个高效的1,从1,从1,从1,从1,从1,从1,从1,从1,从1,从1,从1,从1,从1,从1,从1,从1,从1,从1,从1,从一个高效的,1,从1,从1,从1,从1,从1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1