For the past ten years, CNN has reigned supreme in the world of computer vision, but recently, Transformer has been on the rise. However, the quadratic computational cost of self-attention has become a serious problem in practice applications. There has been much research on architectures without CNN and self-attention in this context. In particular, MLP-Mixer is a simple architecture designed using MLPs and hit an accuracy comparable to the Vision Transformer. However, the only inductive bias in this architecture is the embedding of tokens. This leaves open the possibility of incorporating a non-convolutional (or non-local) inductive bias into the architecture, so we used two simple ideas to incorporate inductive bias into the MLP-Mixer while taking advantage of its ability to capture global correlations. A way is to divide the token-mixing block vertically and horizontally. Another way is to make spatial correlations denser among some channels of token-mixing. With this approach, we were able to improve the accuracy of the MLP-Mixer while reducing its parameters and computational complexity. The small model that is RaftMLP-S is comparable to the state-of-the-art global MLP-based model in terms of parameters and efficiency per calculation. In addition, we tackled the problem of fixed input image resolution for global MLP-based models by utilizing bicubic interpolation. We demonstrated that these models could be applied as the backbone of architectures for downstream tasks such as object detection. However, it did not have significant performance and mentioned the need for MLP-specific architectures for downstream tasks for global MLP-based models. The source code in PyTorch version is available at \url{https://github.com/okojoalg/raft-mlp}.
翻译:过去十年来,CNN在计算机视觉世界中占据了最高地位,但最近,变异器一直在上升。然而,自我关注的二次计算成本在实际应用中已成为一个严重问题。在没有CNN的情况下,已经对建筑进行了大量研究,在这方面没有自我关注。特别是,MLP-Mixer是一个简单的架构,它使用 MLPs 设计了一个简单的架构,并且达到了与Vision 变异器相似的精确度。然而,这个架构中唯一的感知偏差是嵌入标志。这为将非Contraal(或非本地)的自控偏向引入了架构打开了可能性。因此,我们使用两个简单的想法,在MLP-Mixer(或非本地)的自定义偏向ML-ML(ML())的下行偏移偏移偏向ML-ML(ML)(ML)(M-L)(ML)(M-L)(M-L)(ML)(M-L)(ML)(M-L)(ML)(ML)(M-L)(M-L)(ML)(M-L)(的计算精化)(ML)(变)(变码)系统)(变码的计算精化)的精化)的数学)的精细化)的精细化和变的精细化)的精细的精细的精细中,可以用来去。