Previous vision MLPs such as MLP-Mixer and ResMLP accept linearly flattened image patches as input, making them inflexible for different input sizes and hard to capture spatial information. Such approach withholds MLPs from getting comparable performance with their transformer-based counterparts and prevents them from becoming a general backbone for computer vision. This paper presents Hire-MLP, a simple yet competitive vision MLP architecture via \textbf{Hi}erarchical \textbf{re}arrangement, which contains two levels of rearrangements. Specifically, the inner-region rearrangement is proposed to capture local information inside a spatial region, and the cross-region rearrangement is proposed to enable information communication between different regions and capture global context by circularly shifting all tokens along spatial directions. Extensive experiments demonstrate the effectiveness of Hire-MLP as a versatile backbone for various vision tasks. In particular, Hire-MLP achieves competitive results on image classification, object detection and semantic segmentation tasks, e.g., 83.8% top-1 accuracy on ImageNet, 51.7% box AP and 44.8% mask AP on COCO val2017, and 49.9% mIoU on ADE20K, surpassing previous transformer-based and MLP-based models with better trade-off for accuracy and throughput. Code is available at https://github.com/ggjy/Hire-Wave-MLP.pytorch.
翻译:MLP- Mixer 和 ResMLP 等先前的视觉 MLP 等 MLP 和 ResMLP 等先前的视觉 MLP 将线性平板化图像补丁作为输入, 使得它们对于不同的输入大小和难以获取空间信息不具有灵活性。 这种方法使 MLP 无法与基于变压器的对应方取得可比较的性能, 并阻止它们成为计算机视觉的一般主干线。 本文展示了 Hire- MLP, 这是一个简单而具有竞争力的 MLP 结构, 包含两个级别的重新排列 。 具体地说, 提议内区域重新排列以在空间区域内捕捉本地信息, 而跨区域重新排列是为了让不同区域之间的信息交流,并通过在空间方向上循环移动所有符号来捕捉到全球背景。 广泛实验展示了 Hire- MLP 作为各种视觉任务的多功能主干线。 特别是, Hire- MLP 在图像分类、 对象探测和语系分块任务上, 例如, 8.% 顶级- mab- mab- max- max- max- max- mill am- milling mill 的 milling mill 17, 在图像网络上, 在图像网络- sal- sal- sal- sal- box.