Transformers have attracted increasing interests in computer vision, but they still fall behind state-of-the-art convolutional networks. In this work, we show that while Transformers tend to have larger model capacity, their generalization can be worse than convolutional networks due to the lack of the right inductive bias. To effectively combine the strengths from both architectures, we present CoAtNets(pronounced "coat" nets), a family of hybrid models built from two key insights:(1) depthwise Convolution and self-Attention can be naturally unified via simple relative attention; (2) vertically stacking convolution layers and attention layers in a principled way is surprisingly effective in improving generalization, capacity and efficiency. Experiments show that our CoAtNets achieve state-of-the-art performance under different resource constraints across various datasets. For example, CoAtNet achieves 86.0% ImageNet top-1 accuracy without extra data, and 89.77% with extra JFT data, outperforming prior arts of both convolutional networks and Transformers. Notably, when pre-trained with 13M images fromImageNet-21K, our CoAtNet achieves 88.56% top-1 accuracy, matching ViT-huge pre-trained with 300M images from JFT while using 23x less data.
翻译:变异器吸引了计算机视觉中越来越多的兴趣,但是它们仍然落后于最先进的变异网络。 在这项工作中,我们表明,虽然变异器往往具有较大的模型能力,但由于缺乏正确的感官偏差,其一般化可能比变异网络更糟。为了有效地结合这两个结构的优势,我们展示了CoATNets(推出的“coat”网),这组混合模型来自两个关键见解:(1) 深度变异和自我保持可以通过简单的相对关注而自然地统一;(2) 垂直堆叠堆叠层和关注层以原则性的方式在改进一般化、能力和效率方面出乎意料的效果。实验显示,我们的CoAtNet在各种数据集的不同资源制约下,取得了最先进的业绩。例如,CoAtNet在没有额外数据的情况下实现了86.0%的图像网络头1的精确度,89.77%的混合模型具有额外的JFT数据,这两个变异网络和变器的先前艺术表现优异器。 值得注意的是,当预先用13M图像对IMage-21K的13M图像进行了培训时,同时利用23-FFFFF3-21K的高级数据进行最低的精确化数据。