移动ViVTv3:具有简单和有效融合的地方、全球和输入特征的移动友好型愿景变异器 (MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features)

MobileViT (MobileViTv1) combines convolutional neural networks (CNNs) and vision transformers (ViTs) to create light-weight models for mobile vision tasks. Though the main MobileViTv1-block helps to achieve competitive state-of-the-art results, the fusion block inside MobileViTv1-block, creates scaling challenges and has a complex learning task. We propose changes to the fusion block that are simple and effective to create MobileViTv3-block, which addresses the scaling and simplifies the learning task. Our proposed MobileViTv3-block used to create MobileViTv3-XXS, XS and S models outperform MobileViTv1 on ImageNet-1k, ADE20K, COCO and PascalVOC2012 datasets. On ImageNet-1K, MobileViTv3-XXS and MobileViTv3-XS surpasses MobileViTv1-XXS and MobileViTv1-XS by 2% and 1.9% respectively. Recently published MobileViTv2 architecture removes fusion block and uses linear complexity transformers to perform better than MobileViTv1. We add our proposed fusion block to MobileViTv2 to create MobileViTv3-0.5, 0.75 and 1.0 models. These new models give better accuracy numbers on ImageNet-1k, ADE20K, COCO and PascalVOC2012 datasets as compared to MobileViTv2. MobileViTv3-0.5 and MobileViTv3-0.75 outperforms MobileViTv2-0.5 and MobileViTv2-0.75 by 2.1% and 1.0% respectively on ImageNet-1K dataset. For segmentation task, MobileViTv3-1.0 achieves 2.07% and 1.1% better mIOU compared to MobileViTv2-1.0 on ADE20K dataset and PascalVOC2012 dataset respectively. Our code and the trained models are available at: https://github.com/micronDLA/MobileViTv3

翻译：移动ViT (MobileViVtv1) 将移动神经网络(CNNs) 和视觉变压器(ViTs) 结合起来, 以创建移动视觉任务的轻量模型。虽然主要 MobilVTv1 区块有助于实现具有竞争力的状态结果, 但移动ViTv1 区块内的聚合块带来了规模化挑战, 并具有复杂的学习任务。我们提议修改聚合区块, 以创建移动VTV3 区块, 以简单而有效的方式创建 MobilVTv3 区块, 从而解决了缩小和简化学习任务。我们提议的移动ViViV3 移动T 区块: 创建移动ViT3 手机变压模式, XSmov3 移动VVT 区块, 以更好的方式将移动ViV2 基2 和移动VVVVVT 型号数据转换为我们的数字。