For the past ten years, CNN has reigned supreme in the world of computer vision, but recently, Transformer is on the rise. However, the quadratic computational cost of self-attention has become a severe problem of practice. There has been much research on architectures without CNN and self-attention in this context. In particular, MLP-Mixer is a simple idea designed using MLPs and hit an accuracy comparable to the Vision Transformer. However, the only inductive bias in this architecture is the embedding of tokens. Thus, there is still a possibility to build a non-convolutional inductive bias into the architecture itself, and we built in an inductive bias using two simple ideas. A way is to divide the token-mixing block vertically and horizontally. Another way is to make spatial correlations denser among some channels of token-mixing. With this approach, we were able to improve the accuracy of the MLP-Mixer while reducing its parameters and computational complexity. Compared to other MLP-based models, the proposed model, named RaftMLP has a good balance of computational complexity, the number of parameters, and actual memory usage. In addition, our work indicates that MLP-based models have the potential to replace CNNs by adopting inductive bias. The source code in PyTorch version is available at \url{https://github.com/okojoalg/raft-mlp}.
翻译:近十年来,CNN在计算机视觉世界中占据了最高地位,但最近,变异器正在上升。然而,自我注意的二次计算成本已经成为一个严重的实践问题。在没有CNN的情况下,对建筑进行了大量研究,在这方面,没有CNN和自我注意,特别是MLP-Mixer是一个简单的想法,它使用MLPs设计,其精确度与Vision变异器相当。然而,这个结构中唯一的感知偏差是嵌入符号。因此,仍然有可能在建筑本身中建立非横向的内向性偏向,而我们用两个简单的想法构建了隐含的偏向性。一种方式是垂直和横向分割代号混合区块。另一种方式是使某些代号混合渠道之间的空间关系更加密切。我们通过这种方法,提高了MLP-Mixer的准确性,同时降低了其参数和计算复杂性。比照其他基于 MLP 的模型, 模型,名为RaftLPLL, 其真实的模型, 和MLPsalPsimal 的精确度, 的计算模型是我们采用ML- 的模型的精确度。