This paper studies how to keep a vision backbone effective while removing token mixers in its basic building blocks. Token mixers, as self-attention for vision transformers (ViTs), are intended to perform information communication between different spatial tokens but suffer from considerable computational cost and latency. However, directly removing them will lead to an incomplete model structure prior, and thus brings a significant accuracy drop. To this end, we first develop an RepIdentityFormer base on the re-parameterizing idea, to study the token mixer free model architecture. And we then explore the improved learning paradigm to break the limitation of simple token mixer free backbone, and summarize the empirical practice into 5 guidelines. Equipped with the proposed optimization strategy, we are able to build an extremely simple vision backbone with encouraging performance, while enjoying the high efficiency during inference. Extensive experiments and ablative analysis also demonstrate that the inductive bias of network architecture, can be incorporated into simple network structure with appropriate optimization strategy. We hope this work can serve as a starting point for the exploration of optimization-driven efficient network design. Project page: https://techmonsterwang.github.io/RIFormer/.
翻译:本文研究了如何在基本组件中移除Token混合器的同时保持视觉主干网络的有效性。Token混合器,对于视觉transformer(ViTs)的自注意力,旨在在不同空间Token之间执行信息通信,但会导致相当大的计算成本和延迟。然而,直接将它们移除将导致不完整的模型结构,从而带来显着的准确度下降。为此,我们首先利用重新参数化的思想,开发了RepIdentityFormer,研究了不带Token混合器的模型架构。然后,我们探索了改进的学习范式,打破了简单主干网络的限制,并将实证实践总结为5条指南。配备所提出的优化策略,我们能够构建一个高性能的极简视觉主干网络,在推理过程中保持高效。广泛的实验和剖析分析还展示了网络架构的归纳偏差可以与适当的优化策略结合到简单的网络结构中。我们希望这项工作可以作为优化驱动的高效网络设计探索的起点。项目页面:https://techmonsterwang.github.io/RIFormer/。