Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision. Recently, Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and MLP-Mixer, started to lead new trends as they showed promising results in the ImageNet classification task. In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons. To ensure a fair comparison, we first develop a unified framework called SPACH which adopts separate modules for spatial and channel processing. Our experiments under the SPACH framework reveal that all structures can achieve competitive performance at a moderate scale. However, they demonstrate distinctive behaviors when the network size scales up. Based on our findings, we propose two hybrid models using convolution and Transformer modules. The resulting Hybrid-MS-S+ model achieves 83.9% top-1 accuracy with 63M parameters and 12.3G FLOPS. It is already on par with the SOTA models with sophisticated designs. The code and models will be made publicly available.
翻译:革命性神经网络(CNN)是计算机视觉的主要深层神经网络(DNN)结构。最近,以变异器和多层光谱模型(MLP)为基础的模型(如Vision变异器和MLP-Mixer)开始引领新的趋势,因为它们在图像网络分类任务中显示出有希望的结果。在本文件中,我们对这些DNN结构进行了经验研究,并试图了解它们各自的利弊。为了确保公平比较,我们首先开发了一个称为SPANH的统一框架,这个框架采用不同的空间和频道处理模块。我们在SPACH框架下进行的实验显示,所有结构都能在中等规模上取得竞争性的性能。然而,在网络规模扩大时,它们展示了独特的行为。根据我们的调查结果,我们提出使用变异和变异模块的两个混合模型。由此形成的混合-MS-S+模型在63M参数和12.3G FLOPS中达到了83.9%的顶级精确度。它已经与SOTA模型相近且设计精密的模型。代码和模型将被公开使用。