With the success of Vision Transformers (ViTs) in computer vision tasks, recent arts try to optimize the performance and complexity of ViTs to enable efficient deployment on mobile devices. Multiple approaches are proposed to accelerate attention mechanism, improve inefficient designs, or incorporate mobile-friendly lightweight convolutions to form hybrid architectures. However, ViT and its variants still have higher latency or considerably more parameters than lightweight CNNs, even true for the years-old MobileNet. In practice, latency and size are both crucial for efficient deployment on resource-constraint hardware. In this work, we investigate a central question, can transformer models run as fast as MobileNet and maintain a similar size? We revisit the design choices of ViTs and propose an improved supernet with low latency and high parameter efficiency. We further introduce a fine-grained joint search strategy that can find efficient architectures by optimizing latency and number of parameters simultaneously. The proposed models, EfficientFormerV2, achieve about $4\%$ higher top-1 accuracy than MobileNetV2 and MobileNetV2$\times1.4$ on ImageNet-1K with similar latency and parameters. We demonstrate that properly designed and optimized vision transformers can achieve high performance with MobileNet-level size and speed.
翻译:随着视觉变异器在计算机视觉任务中的成功,最近的艺术试图优化ViT的性能和复杂性,以便能够在移动设备上有效部署。提出了多种方法,以加速关注机制,改进低效设计,或纳入对移动友好的轻量变速器以形成混合结构。然而,ViT及其变异器仍然比轻量CNN具有更高的延时率或显著更高的参数,甚至对年久的移动网络来说也是如此。实际上,延时率和体积对于资源约束硬件的有效部署都是至关重要的。在这项工作中,我们调查了一个中心问题,变异器模型能够像移动网络那样快速运行并保持类似的规模?我们重新审视ViT的设计选择,提出一个更精细的联合搜索战略,通过同时优化定位和参数数量,可以找到高效的结构。拟议的模型“高效FormerV2”比移动网络2和移动网络2$约高4 ⁇ 1的精度和移动网络2$0.4美元,在图像网络上,我们重新审视V1400美元的设计,能够以类似的最优化的图像-1K的变速率和最优化的图像-1参数显示我们所设计的最优化的图像和最优化的变能能能和最优化的图像-1.4。