Transformers yield state-of-the-art results across many tasks. However, they still impose huge computational costs during inference. We apply global, structural pruning with latency-aware regularization on all parameters of the Vision Transformer (ViT) model for latency reduction. Furthermore, we analyze the pruned architectures and find interesting regularities in the final weight structure. Our discovered insights lead to a new architecture called NViT (Novel ViT), with a redistribution of where parameters are used. This architecture utilizes parameters more efficiently and enables control of the latency-accuracy trade-off. On ImageNet-1K, we prune the DEIT-Base (Touvron et al., 2021) model to a 2.6x FLOPs reduction, 5.1x parameter reduction, and 1.9x run-time speedup with only 0.07% loss in accuracy. We achieve more than 1% accuracy gain when compressing the base model to the throughput of the Small/Tiny variants. NViT gains 0.1-1.1% accuracy over the hand-designed DEIT family when trained from scratch, while being faster.
翻译:变异器在很多任务中产生最先进的结果。 但是, 在推断过程中, 它们仍然会带来巨大的计算成本 。 我们应用全球结构, 结构运行, 以 latency- aware 模式的所有参数进行 latency- adorization 模型( latency- aware) 常规化, 以降低 latency 模式的所有参数 。 此外, 我们分析修整过的架构, 并在最终重量结构中发现有趣的规律性 。 我们发现的洞察结果导致一个新的架构, 名为 NViT ( Novel Vit), 并在使用参数的地方进行再分配 。 这个架构将参数更有效地使用, 并能够控制 latency- acure 交易 。 在图像Net-1K 上, 我们将 DEIT ( Touvron et al., 2021) 模型用于 用于 DEIT- base ( Touvron, we prime the base) 2.6x FLOPs 减少 、 5.1x 参数减少 参数, 和1. 9x- 时间的加速, rienttime- time passuplester in tradestrate from catch and ben craster by cashn raft by by raft and be raft.