The recently proposed Vision transformers (ViTs) have shown very impressive empirical performance in various computer vision tasks, and they are viewed as an important type of foundation model. However, ViTs are typically constructed with large-scale sizes, which then severely hinder their potential deployment in many practical resources-constrained applications. To mitigate this challenging problem, structured pruning is a promising solution to compress model size and enable practical efficiency. However, unlike its current popularity for CNNs and RNNs, structured pruning for ViT models is little explored. In this paper, we propose GOHSP, a unified framework of Graph and Optimization-based Structured Pruning for ViT models. We first develop a graph-based ranking for measuring the importance of attention heads, and the extracted importance information is further integrated to an optimization-based procedure to impose the heterogeneous structured sparsity patterns on the ViT models. Experimental results show that our proposed GOHSP demonstrates excellent compression performance. On CIFAR-10 dataset, our approach can bring 40% parameters reduction with no accuracy loss for ViT-Small model. On ImageNet dataset, with 30% and 35% sparsity ratio for DeiT-Tiny and DeiT-Small models, our approach achieves 1.65% and 0.76% accuracy increase over the existing structured pruning methods, respectively.
 翻译:最近提出的愿景变压器(Viet 变压器)在各种计算机愿景任务中表现出了非常令人印象深刻的经验性表现,并被视为一种重要的基础模型。然而,ViT通常以大型规模构建,从而严重妨碍其在很多实际资源受限制的应用程序中的潜在部署。为了缓解这一具有挑战性的问题,结构化的裁剪是压缩模型规模和促成实际效率的一个很有希望的解决方案。然而,与其目前对CNN和RNN的受欢迎程度不同,对ViT模型的结构化剪裁很少探索。在本文中,我们提议GOHSP,一个基于图形和优化的结构化结构化结构化结构化框架,用于ViT模型的统一框架。我们首先开发了一个基于图表的排名,用于衡量关注头的重要性,而提取的重要信息进一步整合到一个基于优化的程序,以将多元结构性结构化的偏振幅模式强加在ViT模型上。实验结果表明,我们提议的GOHSP展示了优秀的压缩性表现。在CFAR-10数据集中,我们的方法可以带来40%的参数减少,而 ViT-Small模型没有准确性损失。在图像网络数据设置上,分别有30 %和35的S-treal-de-trodustris-de%-tres-trodustration rodustration roductions