Different from traditional convolutional neural network (CNN) and vision transformer, the multilayer perceptron (MLP) is a new kind of vision model with extremely simple architecture that only stacked by fully-connected layers. An input image of vision MLP is usually split into multiple tokens (patches), while the existing MLP models directly aggregate them with fixed weights, neglecting the varying semantic information of tokens from different images. To dynamically aggregate tokens, we propose to represent each token as a wave function with two parts, amplitude and phase. Amplitude is the original feature and the phase term is a complex value changing according to the semantic contents of input images. Introducing the phase term can dynamically modulate the relationship between tokens and fixed weights in MLP. Based on the wave-like token representation, we establish a novel Wave-MLP architecture for vision tasks. Extensive experiments demonstrate that the proposed Wave-MLP is superior to the state-of-the-art MLP architectures on various vision tasks such as image classification, object detection and semantic segmentation.
翻译:与传统的进化神经网络(CNN)和视觉变压器不同,多层光谱(MLP)是一种新型的视觉模型,其结构极其简单,只有完全相连的层层才能堆叠。 MLP 的输入图像通常被分割成多个符号( Patches),而现有的 MLP 模型则直接用固定的重量将它们组合在一起,忽略了不同图像的符号的不同语义信息。对于动态聚合的符号,我们提议将每个符号作为波函数代表,有两个部分,即振幅和阶段。振幅是最初的特征,而阶段术语则是根据输入图像的语义内容变化的复杂值。 引入阶段术语可以动态调节MLP 的符号和固定重量之间的关系。 基于波状象征性表示,我们为愿景任务建立了一个新型的波- MLP 结构。 广泛的实验证明, 拟议的波- MLP 相对于图像分类、 对象探测和语义断段等各种视觉任务中的最新的 MLP 结构而言, 。