我们准备好接受新的范式转变了吗? (Are we ready for a new paradigm shift? A Survey on Visual Deep MLP)

Multilayer perceptron (MLP), as the first neural network structure to appear, was a big hit. But constrained by the hardware computing power and the size of the datasets, it once sank for tens of years. During this period, we have witnessed a paradigm shift from manual feature extraction to the CNN with local receptive fields, and further to the Transform with global receptive fields based on self-attention mechanism. And this year (2021), with the introduction of MLP-Mixer, MLP has re-entered the limelight and has attracted extensive research from the computer vision community. Compare to the conventional MLP, it gets deeper but changes the input from full flattening to patch flattening. Given its high performance and less need for vision-specific inductive bias, the community can't help but wonder, Will MLP, the simplest structure with global receptive fields but no attention, become a new computer vision paradigm? To answer this question, this survey aims to provide a comprehensive overview of the recent development of vision deep MLP models. Specifically, we review these vision deep MLPs detailedly, from the subtle sub-module design to the global network structure. We compare the receptive field, computational complexity, and other properties of different network designs in order to have a clear understanding of the development path of MLPs. The investigation shows that MLPs' resolution-sensitivity and computational densities remain unresolved, and pure MLPs are gradually evolving towards CNN-like. We suggest that the current data volume and computational power are not ready to embrace pure MLPs, and artificial visual guidance remains important. Finally, we provide an analysis of open research directions and possible future works. We hope this effort will ignite further interest in the community and encourage better visual tailored design for the neural network at the moment.

翻译：作为第一个神经网络结构出现的第一个神经网络结构(MLP ), 是一个巨大的打击。但受到硬件计算力和数据集大小的限制, 它在数十年中一度沉没。在此期间, 我们目睹了由手动地物提取到有本地接收场的CNN的范式转变, 更进一步到基于自我关注机制的全球接受场的转型。今年(2021年), 引入了 MLP- Mixer, MLP 重新进入了光亮, 吸引了计算机视觉界的广泛研究。与常规的 MLP 相比, 它变得更深了, 却改变了从完全平坦缩到平坦化的输入。鉴于其高性能和较少需要针对视觉的偏差偏差的偏差场, 社区不能不禁想知道, Will MLP, 以全球接受场最简单的结构, 可能变成一个新的计算机视觉范式。为了回答这个问题, 本次调查旨在全面介绍最新的直观MLP 的快速发展方向。具体地, 我们审查这些直观的MLP, 的直观性网络的精度, 从最后的精细的精细的网络的网络显示了我们的精细的网络的构造的构造, 的精细的模型的构造的构造, 从最后的精细的精细的精细的构造,,,, 显示的精细的精细的精细的模型的构造的构造的精细的网络,,, 从我们的精细的精细的构造, 向,, 从我们的精细的精细的精细的精细的精细的精细的构造的构造的构造的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的网络的构造的网络, 从的网络的网络的计算,, 从我们的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的构造的构造的构造的构造的构造的构造的构造, 从我们的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的