Recently, the proposed deep MLP models have stirred up a lot of interest in the vision community. Historically, the availability of larger datasets combined with increased computing capacity leads to paradigm shifts. This review paper provides detailed discussions on whether MLP can be a new paradigm for computer vision. We compare the intrinsic connections and differences between convolution, self-attention mechanism, and Token-mixing MLP in detail. Advantages and limitations of Token-mixing MLP are provided, followed by careful analysis of recent MLP-like variants, from module design to network architecture, and their applications. In the GPU era, the locally and globally weighted summations are the current mainstreams, represented by the convolution and self-attention mechanism, as well as MLP. We suggest the further development of paradigm to be considered alongside the next-generation computing devices.
翻译:最近,提议的深层 MLP 模型激起了人们对视觉界的极大兴趣。 从历史上看,提供更大的数据集,再加上计算能力的提高,会导致范式的转变。本审查文件详细讨论了MLP能否成为计算机视野的新范例。我们详细比较了融合、自我注意机制和Token混合MLP之间的内在联系和差异。提供了Token混合MLP的优点和局限性,随后仔细分析了最近的MLP类变体,从模块设计到网络结构,以及它们的应用。在GPU时代,当地和全球加权加和是当前的主流,由聚合和自我注意机制以及MLP代表。我们建议进一步发展模式,与下一代计算装置一起考虑。