Light-weight convolutional neural networks (CNNs) are the de-facto for mobile vision tasks. Their spatial inductive biases allow them to learn representations with fewer parameters across different vision tasks. However, these networks are spatially local. To learn global representations, self-attention-based vision trans-formers (ViTs) have been adopted. Unlike CNNs, ViTs are heavy-weight. In this paper, we ask the following question: is it possible to combine the strengths of CNNs and ViTs to build a light-weight and low latency network for mobile vision tasks? Towards this end, we introduce MobileViT, a light-weight and general-purpose vision transformer for mobile devices. MobileViT presents a different perspective for the global processing of information with transformers, i.e., transformers as convolutions. Our results show that MobileViT significantly outperforms CNN- and ViT-based networks across different tasks and datasets. On the ImageNet-1k dataset, MobileViT achieves top-1 accuracy of 78.4% with about 6 million parameters, which is 3.2% and 6.2% more accurate than MobileNetv3 (CNN-based) and DeIT (ViT-based) for a similar number of parameters. On the MS-COCO object detection task, MobileViT is 5.7% more accurate than Mo-bileNetv3 for a similar number of parameters.
翻译:轻量级神经神经网络( CNNs) 是移动视觉任务( 轻量级神经网络) 。 它们的空间感应偏差允许它们学习不同视觉任务中较少参数的表达方式。 但是, 这些网络是空间性的。 要了解全球表现, 已经采用了基于自我注意的视觉转换器( VITs ) 。 与CNNs不同, ViTs 重量重。 在本文中, 我们询问以下问题: 能否将CNNs 和 Vits 的优势结合起来, 为移动视觉任务建立一个轻量级和低潜值网络? 如此一端, 我们引入了 Move ViT, 一个用于移动设备的轻量级和通用视觉变异变异器。 移动ViViT 与基于不同的任务和数据集相比, 移动 ViWT 的优势显著超越了CNN- 和 ViT 的网络。 在图像Net-1 的数据集中, 移动ViWT 的顶级精确度网络比78. 4%, 大约600, 其中移动3 NS- DVIT 的精确度为 3. 62% 和深点数。