追求愿景变异器中的公平:最终到最终探索 (Chasing Sparsity in Vision Transformers: An End-to-End Exploration)

Vision transformers (ViTs) have recently received explosive popularity, but their enormous model sizes and training costs remain daunting. Conventional post-training pruning often incurs higher training budgets. In contrast, this paper aims to trim down both the training memory overhead and the inference complexity, without sacrificing the achievable accuracy. We launch and report the first-of-its-kind comprehensive exploration, on taking a unified approach of integrating sparsity in ViTs "from end to end". Specifically, instead of training full ViTs, we dynamically extract and train sparse subnetworks, while sticking to a fixed small parameter budget. Our approach jointly optimizes model parameters and explores connectivity throughout training, ending up with one sparse network as the final output. The approach is seamlessly extended from unstructured to structured sparsity, the latter by considering to guide the prune-and-grow of self-attention heads inside ViTs. For additional efficiency gains, we further co-explore data and architecture sparsity, by plugging in a novel learnable token selector to adaptively determine the currently most vital patches. Extensive results on ImageNet with diverse ViT backbones validate the effectiveness of our proposals which obtain significantly reduced computational cost and almost unimpaired generalization. Perhaps most surprisingly, we find that the proposed sparse (co-)training can even improve the ViT accuracy rather than compromising it, making sparsity a tantalizing "free lunch". For example, our sparsified DeiT-Small at (5%, 50%) sparsity for (data, architecture), improves 0.28% top-1 accuracy, and meanwhile enjoys 49.32% FLOPs and 4.40% running time savings. Our codes are available at https://github.com/VITA-Group/SViTE.

翻译：视觉变异器(ViTs)最近受到创世欢迎,但其巨大的模型规模和培训成本仍然令人生畏。常规的训练后调整往往需要更高的培训预算。与此形成对照的是,本文的目的是在不牺牲可实现的准确性的情况下,缩小培训记忆管理费用和推论复杂性,同时不牺牲培训记忆管理费用。我们推出和报告其同类的首度全面探索,即“从头到尾”统一整合ViTs中的超音速。具体地说,我们不全面培训完全的ViTs,而是动态地提取和培训稀释的子网络,同时坚持固定的小型参数预算。我们的方法是优化模型参数参数,探索整个培训过程中的连通性,最终以一个稀少的网络结束。这个方法从无缝扩展到结构上的松散,后者是考虑指导ViTs内部自我保存的头部的边际和宽宽度。为了提高效率,我们进一步合作地将ViLOVi-OVi-Oi-Oiality数据和结构的宽度,通过新式的可选的选取的标志来确定当前最关键的准确性决定。SlodS-L-reval-ral-ralal-alalalalalalalalalalalalalalalalalalalal 。在我们的拟议成本中可以大幅上大大地验证我们的软化中获取到最低成本。