We present Mobile-Former, a parallel design of MobileNet and transformer with a two-way bridge in between. This structure leverages the advantages of MobileNet at local processing and transformer at global interaction. And the bridge enables bidirectional fusion of local and global features. Different from recent works on vision transformer, the transformer in Mobile-Former contains very few tokens (e.g. 6 or fewer tokens) that are randomly initialized to learn global priors, resulting in low computational cost. Combining with the proposed light-weight cross attention to model the bridge, Mobile-Former is not only computationally efficient, but also has more representation power. It outperforms MobileNetV3 at low FLOP regime from 25M to 500M FLOPs on ImageNet classification. For instance, Mobile-Former achieves 77.9\% top-1 accuracy at 294M FLOPs, gaining 1.3\% over MobileNetV3 but saving 17\% of computations. When transferring to object detection, Mobile-Former outperforms MobileNetV3 by 8.6 AP in RetinaNet framework. Furthermore, we build an efficient end-to-end detector by replacing backbone, encoder and decoder in DETR with Mobile-Former, which outperforms DETR by 1.1 AP but saves 52\% of computational cost and 36\% of parameters.
翻译:我们展示了移动- Former, 移动网络和变压器的平行设计, 中间有双向桥梁。 这个结构在本地处理和变压器上利用了移动网络的优势, 在全球互动中, 这个结构在本地处理和变压器上利用了移动- Former 的优势。 这个桥可以使本地和全球特性双向融合。 不同于最近关于视觉变压器的工程, 移动- Former 的变压器含有很少的随机初始化符号( 例如6个或更少的表示器), 以学习全球前科, 导致计算成本低。 结合了拟议对模拟桥的轻度交叉关注, 移动- Former 不仅计算效率高, 而且还具有更大的代表力。 它在低FLOP 系统下, 从 25M 到 500M FLOP 的移动- FLOPs 上优于移动网络 3 。 例如, 移动- Flive- Former 的变压器实现了77.9+ 1的精度精确度, 在移动- Net 3 和 DETR 格式 框架中, 以最终 取代了 的 。