Transformers have attracted increasing interests in computer vision, but they still fall behind state-of-the-art convolutional networks. In this work, we show that while Transformers tend to have larger model capacity, their generalization can be worse than convolutional networks due to the lack of the right inductive bias. To effectively combine the strengths from both architectures, we present CoAtNets(pronounced "coat" nets), a family of hybrid models built from two key insights: (1) depthwise Convolution and self-Attention can be naturally unified via simple relative attention; (2) vertically stacking convolution layers and attention layers in a principled way is surprisingly effective in improving generalization, capacity and efficiency. Experiments show that our CoAtNets achieve state-of-the-art performance under different resource constraints across various datasets: Without extra data, CoAtNet achieves 86.0% ImageNet top-1 accuracy; When pre-trained with 13M images from ImageNet-21K, our CoAtNet achieves 88.56% top-1 accuracy, matching ViT-huge pre-trained with 300M images from JFT-300M while using 23x less data; Notably, when we further scale up CoAtNet with JFT-3B, it achieves 90.88% top-1 accuracy on ImageNet, establishing a new state-of-the-art result.
翻译:变异器吸引了计算机视野中越来越多的兴趣,但是它们仍然落后于最先进的连锁网络。 在这项工作中,我们表明,虽然变异器倾向于拥有更大的模型能力,但由于缺乏正确的感官偏差,其一般化程度可能比变异网络更糟。 要有效地将两种结构的强项结合起来,我们提供CoatNets(推出的“coat”网),这是一套由两种关键见解组成的混合模型:(1) 深度变异和自我保护可以通过简单的相对关注而自然地统一;(2) 垂直叠合层和关注层以原则性的方式在改进一般化、能力和效率方面效果惊人。 实验显示,我们的CoATNets在各种数据集的不同资源制约下取得了最先进的业绩:没有额外数据,CoatNets(推出的“coat网”网就实现了86.0%的图像网顶级-1的准确性;当预先用来自图像网络21K的13M图像进行训练时,我们的CotNet可以达到8856%的顶级-1,将Vit-huge前的堆积层层和300M图像与来自JFT-3-300的高级图像在23GIental-FT-B上实现的高级数据后, 23G-I-FT-T-I-I-ID-T-T-T-G-I-I-G-T-T-T-G-T-T-I-G-G-G-G-G-G-G-300取得的300的300的300的300。