Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. "mixing" the per-location features), and one with MLPs applied across patches (i.e. "mixing" spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models. We hope that these results spark further research beyond the realms of well established CNNs and Transformers.
翻译:进化神经网络(CNNs)是计算机视觉的进化模型。 最近,关注网络,如愿景变异器,也变得很受欢迎。 在本文中,我们显示,虽然进化和关注都足以取得良好业绩,但两者都无必要。 我们介绍了MLP-Mixer,这是一个完全基于多层感应器(MLP-Mixer)的架构。 MLP-Mixer包含两类层面:一个是MLPs,独立应用于图像补丁(即“混合”每个定位特征),另一个是MLPs,跨补接(即“混合”空间信息)应用MLPs。在接受大型数据集培训或现代正规化计划培训时,MLP-Mixer在图像分类基准上获得竞争性评分,其培训前和推断成本可与最新模型相比。 我们希望这些结果能激发在已建立良好的CNN和变异器领域以外的进一步研究。