Although convolutional networks have been the dominant architecture for vision tasks for many years, recent experiments have shown that Transformer-based models, most notably the Vision Transformer (ViT), may exceed their performance in some settings. However, due to the quadratic runtime of the self-attention layers in Transformers, ViTs require the use of patch embeddings, which group together small regions of the image into single input features, in order to be applied to larger image sizes. This raises a question: Is the performance of ViTs due to the inherently-more-powerful Transformer architecture, or is it at least partly due to using patches as the input representation? In this paper, we present some evidence for the latter: specifically, we propose the ConvMixer, an extremely simple model that is similar in spirit to the ViT and the even-more-basic MLP-Mixer in that it operates directly on patches as input, separates the mixing of spatial and channel dimensions, and maintains equal size and resolution throughout the network. In contrast, however, the ConvMixer uses only standard convolutions to achieve the mixing steps. Despite its simplicity, we show that the ConvMixer outperforms the ViT, MLP-Mixer, and some of their variants for similar parameter counts and data set sizes, in addition to outperforming classical vision models such as the ResNet. Our code is available at https://github.com/locuslab/convmixer.
翻译:虽然多年来共变网络一直是愿景任务的主要架构,但最近的实验显示,基于变异器的模式,特别是愿景变异器(VIT),在某些环境下可能超过其性能。然而,由于变异器中自我注意层的四度运行时间,ViT要求使用补丁嵌入器,将图像的较小区域组合成单一输入功能,以便应用到更大的图像大小。这提出了一个问题:ViTs的性能是源于内在更强大的变异器结构,还是至少部分是由于使用补丁作为输入代表?在本文件中,我们为后者提出了一些证据:具体地说,我们建议ConvMixer,这是一个非常简单的模型,在精神上与ViT和甚至更基本的 MLP-Mixer相似,因为它直接以补差为输入,将空间和频道尺寸的混合,并在整个网络中保持同等的大小和分辨率。然而,ConMixer使用一些标准变异器在变异器中只使用类似的变异器来达到变异变式。尽管我们的一些CRix-Rix-Mix 的计算方法,但是,ConMix-Rix-Lex-CVixer 显示了我们的变变变异的精确的缩缩缩。