Recently, Transformer networks have achieved impressive results on a variety of vision tasks. However, most of them are computationally expensive and not suitable for real-world mobile applications. In this work, we present Mobile Convolutional Vision Transformer (MoCoViT), which improves in performance and efficiency by introducing transformer into mobile convolutional networks to leverage the benefits of both architectures. Different from recent works on vision transformer, the mobile transformer block in MoCoViT is carefully designed for mobile devices and is very lightweight, accomplished through two primary modifications: the Mobile Self-Attention (MoSA) module and the Mobile Feed Forward Network (MoFFN). MoSA simplifies the calculation of the attention map through Branch Sharing scheme while MoFFN serves as a mobile version of MLP in the transformer, further reducing the computation by a large margin. Comprehensive experiments verify that our proposed MoCoViT family outperform state-of-the-art portable CNNs and transformer neural architectures on various vision tasks. On ImageNet classification, it achieves 74.5% top-1 accuracy at 147M FLOPs, gaining 1.2% over MobileNetV3 with less computations. And on the COCO object detection task, MoCoViT outperforms GhostNet by 2.1 AP in RetinaNet framework.
翻译:最近,变压器网络在各种视觉任务上取得了令人印象深刻的成果。 然而,大多数变压器网络在计算上成本昂贵,不适合真实世界移动应用。 在这项工作中,我们展示了移动革命愿景变换器(MocoViT),通过将变压器引入移动革命网络来提高性能和效率,以利用两种结构的效益。不同于最近对视觉变压器的工程,MoCoViT中的移动变压器块是为移动设备精心设计的,并且非常轻巧,通过两项主要修改完成:移动自控模块(MoSA)和移动种子前方网络(MoFFN)。 MOSA通过分流共享计划简化对关注图的计算,而MOFFN在变压器中充当MLP的移动版本,进一步降低这两个结构的计算。 全面实验核实了我们提议的MoCoViT家族在各种视觉任务上超越了最先进的状态,便携式CNN和变压器神经结构。 在图像网络分类中,在147MLOPS上实现了74.5%的顶级精确度1,在MOVIFOP-3上获得了1.2的移动网络测试。