Vision transformer (ViT) recently has drawn great attention in computer vision due to its remarkable model capability. However, most prevailing ViT models suffer from huge number of parameters, restricting their applicability on devices with limited resources. To alleviate this issue, we propose TinyViT, a new family of tiny and efficient small vision transformers pretrained on large-scale datasets with our proposed fast distillation framework. The central idea is to transfer knowledge from large pretrained models to small ones, while enabling small models to get the dividends of massive pretraining data. More specifically, we apply distillation during pretraining for knowledge transfer. The logits of large teacher models are sparsified and stored in disk in advance to save the memory cost and computation overheads. The tiny student transformers are automatically scaled down from a large pretrained model with computation and parameter constraints. Comprehensive experiments demonstrate the efficacy of TinyViT. It achieves a top-1 accuracy of 84.8% on ImageNet-1k with only 21M parameters, being comparable to Swin-B pretrained on ImageNet-21k while using 4.2 times fewer parameters. Moreover, increasing image resolutions, TinyViT can reach 86.5% accuracy, being slightly better than Swin-L while using only 11% parameters. Last but not the least, we demonstrate a good transfer ability of TinyViT on various downstream tasks. Code and models are available at https://github.com/microsoft/Cream/tree/main/TinyViT.
翻译:视觉变异器(ViT)最近由于其非凡的模型能力,在计算机视觉中引起极大关注。然而,大多数流行的ViT模型都受到大量参数的影响,限制了其在资源有限的装置上的适用性。为了缓解这一问题,我们提议TnyViT,这是一个由小而高效的小视觉变异器组成的新大家庭,这个小而高效的小视觉变异器先用我们提议的快速蒸馏框架在大型数据集上培训,先在大型的小型数据集上训练,然后将知识从大型预先培训的模型转移到小型模型,同时使小型模型获得大规模预培训前数据红利。更具体地说,我们在知识转让前的训练中应用蒸馏法。大型教师模型的登录程序已经被筛选并储存在磁盘中,以节省记忆成本和计算间接费用。小型学生变异变器从一个具有计算和参数限制的大型预培训型模型中自动缩缩缩缩缩。全面实验显示TyyViT的功效。在图像Net-1k上达到84.8%的顶级精确度,只有21M的参数,与Swin-B在图像网/Gliory 21k上进行初步训练,同时使用4.2倍的参数。此外,使用较短的参数。此外,增加图像分辨率分辨率分辨率分辨率分辨率分辨率的分辨率的分辨率的分辨率的分辨率的分辨率/RibyTinTinTibyTyTyTyThyL能力在小的精确度上只能可达86.5。