Mobile vision transformers (MobileViT) can achieve state-of-the-art performance across several mobile vision tasks, including classification and detection. Though these models have fewer parameters, they have high latency as compared to convolutional neural network-based models. The main efficiency bottleneck in MobileViT is the multi-headed self-attention (MHA) in transformers, which requires $O(k^2)$ time complexity with respect to the number of tokens (or patches) $k$. Moreover, MHA requires costly operations (e.g., batch-wise matrix multiplication) for computing self-attention, impacting latency on resource-constrained devices. This paper introduces a separable self-attention method with linear complexity, i.e. $O(k)$. A simple yet effective characteristic of the proposed method is that it uses element-wise operations for computing self-attention, making it a good choice for resource-constrained devices. The improved model, MobileViTv2, is state-of-the-art on several mobile vision tasks, including ImageNet object classification and MS-COCO object detection. With about three million parameters, MobileViTv2 achieves a top-1 accuracy of 75.6% on the ImageNet dataset, outperforming MobileViT by about 1% while running $3.2\times$ faster on a mobile device. Our source code is available at: \url{https://github.com/apple/ml-cvnets}
翻译:移动视觉变压器(MobileViT)可以在包括分类和检测在内的若干移动视觉任务中实现最先进的性能,包括分类和检测。虽然这些模型的参数较少,但它们与以神经神经网络为基础的模型相比具有较高的潜值。 MobileVet 中的主要效率瓶颈是变压器中多头自省(MHA),这需要花费O(k)2美元的时间复杂性,这需要花费在象征(或补丁)数量上。此外,MHA需要花费昂贵的操作(例如批量化矩阵倍增)来计算自控能力,对资源限制装置造成影响。本文介绍了一个具有线性复杂性(即$O(k)美元)的静态自留方法。拟议方法的一个简单而有效的特征是,它使用元素智能操作来计算自留量(或补丁) $k$。此外,MFIVT 改进的模型(MliveVT2) 是一些移动视觉任务上最高级的艺术,包括图像网/内置目标的精确度为75 VILVILO(O)的S-ILVI) 6级数据分类和S-IS-IS-IS-ILVD-ILVLVT) 3S-IS-ILS-ILVDM-ILVLVDMS)